Vector Auto Regression (VAR)

To answer many of the questions raised in the 'Introduction' section, I need to apply Vector Auto Regression models, which can be thought of as multivariate time series. My main goal with this analysis is to see how Google Searches for economic terms might relate to more formal measures of economic well-being, like unemployment or the consumer sentiment index (see 'Data Sources').

For this analysis, I run a VAR model with two formal economic measurements - unemployment rate & consumer sentiment index - along with two Google trends searches, for 'unemployment' and 'food stamps'. I did not include data on actual food stamp enrollment because my visualization showed that this was very smooth across years and didn't track particulalry well with the other economic variables. For these monthly time series, I limit the data to dates from January 2004 through February 2020. The start date is when the Google Trends data becomes available, and the end date is to cut out the COVID-era outliers, which were significant (see 'Data Visualizations').

Visualizing Correlation

First, I plot a scatterplot matrix to explore the potential correlation between these four variables.

Unemployment Rate: there is a strong positive correlation between the unemployment rate and google searches for 'unemployment'. There is a strong negative relationship with the consumer sentiment index. There is not a clear relationship with google searches for 'food stamps'.
Google Searches for 'Unemployment': There is some negative correlation with consumer sentiment, and no clear relationship with searches for 'food stamps'.
Consumer Sentiment: no clear correlation with google searches for 'food stamps'.
Google Searches for 'Food Stamps': see above.

Fitting a VAR(1) Model

The next step is to fit a VAR model on these four time series.

I first modeled a simple VAR with p = 1, to see if there was statistical significance. This was run on the raw data, which were not stationary. The results of the VAR(1) model are shown below. These generally have very high R-squared values (ranging from 0.84 to 0.99), but this is likely because each variable is regressed on the 1-lag value of itself; since the data isn't differenced, there is already a very high correlation between, for example, Unemployment in month T and in month T-1.

Here is the equation of the VAR(1) model on these 4 variables

The VAR(1) model on unemployment has all statistically significant coefficients.

The VAR(1) model on google searches for Unemployment also has all statistically significant coefficients.

The VAR(1) model on consumer sentiment was only significantly related to the single lagged value of itself, and not the others. This is surprising because the scatter plots above seem to show some relationship between consumer sentiment and unemployment.

The VAR(1) model on Google searches for food stamps was significantly related to everything but consumer sentiment.

Model Alternatives

I also ran R's VARselect() tool, which suggested that the AIC of these models would be minimized with a p value of 4. However, these models mostly did not have significant coefficients, suggesting they are not appropriate. Per the parsimonious principle, it is better to stick to p = 1.

Model Evaluation

It is necessary to examine the residuals of the VAR(1) model on the undifferenced data.

The following figures show the ACF plots of the residuals for these four models. The model of searches on unemployment still has a number of significant values, suggesting it is not the best possible fit. In particular, there are high values at lags 6, 12, and 18, suggesting a seasonality.

Portmanteau Test.

The portmanteau test returned a p-value of 7.8*10^9, which is a good result.

Fitting an ARMAX Model

I also fit an ARMAX model, which is like an ARIMA model that takes into account several exogenous variables. While the VAR fits all the possible models at once (with each different variable as a possible outcome), the ARMAX takes one outcome variable and does an automatically-fitted ARIMA with the other variables as exogenous input. I chose to fit an ARMAX model of the unemployment rate, with three exogenous variables: google searches for 'unemployment', google searches for 'food stamps', and consumer sentiment. R's auto.arima() function chose an ARIMA(1,1,5).

The residuals of this ARMAX are shown below. There are some residual lags that are significant, but not many, and the residuals themselves resemble white noise, which is good.

Image source: NeONBRAND at https://unsplash.com/photos/AOJGuIJkoBc