ARMA/ARIMA/SARIMA Models

Modeling US Unemployment

For this modeling, I chose to cut off recent data points after January 2020, to remove the spike in data related to the COVID-19 pandemic.

To build an ARIMA model of unemployment, I first determined that the original data is not stationary. While the ADF test reports a value lower than .05, this test is unreliable. The ACF plot contains high lags for many values, demonstrating that the data is not stationary.

To make the data stationary, I applied 12-month seasonal differencing and a single round of simple differencing. The ADF test of the first-differenced series returns a value of .01. Below is the plotted first-differenced & seasonal-differenced data and the ACF & PACF plot.

The plots shown above help us identify the order of the SARIMA model. We plot the ACF graph to give us *q* and *Q*, and the PACF value to give us *p* and *P*. From the plots, I will try a SARIMA(0,1,0)(3,1,1)12 model.

Using R's sarima() function, the fitted SARIMA(0,1,0)(3,1,1)12 model has all significant coefficients.

The equation of the model is:

X_t= X_t-1 + .81X_t-12 + 1.19X_t-13 + .01X_t-24 -.01X_t-25 + .03X_t-36 -.03X_t-37 + .15X_t-48 - .15X_t-49 + w_t - w_t-12

The following figure contains the diagnostic plots of this model. The top panel shows that the residuals are centered around 0 and have a relatively constant variance, at least until 2020.

The ACF of the residuals looks good, as all values are inside the confidence interval - this means the residuals are stationary.

The Q-Q plot lies along the straight line for most of its values, which is also a good sign.

Finally, the last plot shows Ljung-Box statistics. This makes the model not look like a very good fit.

R's auto.arima() function returned a best fit as ARIMA(4,1,1)(2,0,2)[12]. The following figure shows the diagnostic plots of this suggested model, which has a much lower AIC and BIC than the model I initially tried myself, suggesting it is a better model.

The following figure shows US unemployment forecasted over the next three years using the ARIMA(4,1,1)(2,0,2)[12], that was suggested by auto.arima(). This is a very broad range of outcomes, perhaps because the model cannot account for the decade-long business cycle.

Modeling Google Searches for Unemployment

Next, I modeled a time series of google trend searches for 'unemployment'.

To build an ARIMA model of these google searches, I first determined that the original data is not stationary. The ADF test reports a value of .85, suggesting that the data absolutely is not stationary. The ACF plot also contains high lags for many values, demonstrating that the data is not stationary.

The plots shown above help us identify the order of the SARIMA model. We plot the ACF graph to give us *q* and *Q*, and the PACF value to give us *p* and *P*. From the plots, I decided to try a range of models, including:

SARIMA(0,1,0)(2,1,1)12
SARIMA(1,1,0)(2,1,1)12
SARIMA(2,1,0)(2,1,1)12
SARIMA(3,1,0)(2,1,1)12
SARIMA(4,1,0)(2,1,1)12

Of these models, those with p = 2 and p = 4 were of the lowest AIC/ BIC values, so I'll go with p = 2, because simpler models are generally better.

However, the fitted model reports 5 coefficients, 3 of which are not statistically significant, suggesting it isn't a very good model.

The equation for this is horrifically long:

X_t= .91X_t-1 -.13X_t-2 + .22X_t-3 + 1.13X_t-12 + .97X_t-13 + .02X_t-14 -.13X_t-15 +.15X_t-24 + .06X_t-25 -.015X_t-26 -.007X_t-27 -.06X_t-36 +.055X_t-37 + .008X_t-38 + .01X_t-39 + w_t + X_t-12

The following figure contains the diagnostic plots of this model. The top panel shows that the residuals are centered around 0. There is a higher variance of these residuals around the middle of the time span, which may be an issue.

The ACF of the residuals looks good, as all values are inside the confidence interval - this means the residuals are stationary.

The Q-Q plot lies along the straight line for most of its values, which is also a good sign.

Finally, the last plot shows Ljung-Box statistics. These are all significant, suggesting we have a good model fit.

R's auto.arima() function returned a best fit as ARIMA(0,1,0)(2,0,0)[12]. The following figure shows the diagnostic plots of this suggested model. Interestingly, this model has higher AIC and BIC values than the one I show above, so it may not be as good of a model. This may show in the Ljung-Box statistics plot, where none of the values are significant.

The equation for this auto.arima() model is :

X_t= X_t-1 + .32X_t-12 -.32X_t-13 +.25X_t-24 - .25X_t-25 + w_t -.0081

The following figure shows google searches for US unemployment forecasted over the next three years using the ARIMA(2,1,0)(2,1,1)[12], that had the lowest AIC/ BIC values.