Computational Statistics 10.2: Time

Time
Time series
Regression
Auto-correlation
Independent variables
Prediction

Time

Overview

This lesson introduces time series modeling.

Objectives

After completing this module, students should be able to:

Difference time series data to make it stationary.
Use PACF and ACF functions to assess AR and MA models.
Estimate ARIMA models using R and predict new values.

Readings

Lander, Chapter 21.1

Time series

Temporal data is some of the trickiest to deal with, and is something we can only touch on here. To be honest, the main purpose of this lesson is just to illustrate how tricky it is, and how you should be very careful when making inferences from temporal data until you’ve taken an entire course on the subject.

The main problem with time is that with multiple regression, we have been assuming all the observations are independent of each other; with time, almost by definition the observation at time \(t\) is likely to be correlated with the observation at time \(t-1\). Furthermore, if you seek to find the effect of X on Y and both share an overall trend (eg, both go up historically), then you will see a spurious relationship if you regress Y on X due only to this shared trend, and not any actual causal effect of X on Y.

Consider the following example (taken from Lander) of per capita GDP from 1960 to 2011, for the US and Canada:

#install.packages("WDI")
library(WDI)
gdp <- WDI(country=c("US","CA"),indicator="NY.GDP.PCAP.CD",start=1960,end=2011)
colnames(gdp)[3] <- "PerCapGDP"

library(ggplot2)
ggplot(gdp,aes(year,PerCapGDP,color=country)) + geom_line()

If we were to regress these two series on each other, we would find a very strong coefficient:

gdpUS <- gdp$PerCapGDP[gdp$country=="United States"]
gdpCA <- gdp$PerCapGDP[gdp$country=="Canada"]
summary(lm(gdpUS~gdpCA))


Call:
lm(formula = gdpUS ~ gdpCA)

Residuals:
   Min     1Q Median     3Q    Max 
-10260  -1763  -1242   1276   9277 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1976.6600   890.3443    2.22    0.031 *  
gdpCA          1.1216     0.0411   27.29   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3909 on 50 degrees of freedom
Multiple R-squared:  0.9371,    Adjusted R-squared:  0.9358 
F-statistic: 744.6 on 1 and 50 DF,  p-value: < 2.2e-16

But is that because the US economy is being caused by Canada, or vice versa – or just due to the fact that most developed countries grew during this time period?

Differencing

The first thing we need to do is to “detrend” the data – remove the overall trend upwards, and make the data more like what we see in a regular regression, where X and Y are a bunch of data points with no overall tendencies up or down from one data point to the next. The key insight is that we are fundamentally interested not in the effect of X on Y if they are both trending up (or going up and down seasonally), but rather the effect of the change in X on the change in Y. That is, if we think a “shock” to X (a short-term change in X) has a short-term effect on Y, then even if they are both trending up (or both going up and down seasonally), we would expect a quick change in X to have a quick effect on Y, even as both are generally trending upward. To model this, we once again do our trick of transforming the variables – this time, by “differencing” them.

What this means is that instead of looking at Y as a function of X, we look at the sequence \(Y_t - Y_{t-1}\) as function of \(X_t - X_{t-1}\). That is, we look at the year-to-year change in Y on the year-to-year change in X. If Y is {1, 3, 2, 5}, then the first differences are {2,-1,3}. If we do this, we will tend to see the series flatten out, and look more like normal, non-trending data.

Differencing 2

We can generate this first difference automatically via the diff function, or manually by just subtracting each \(Y_{t-1}\) from \(Y_{t}\).

To do it automatically, first we have to code the data as a time series (ts) so that R understands what to do with it:

us <- rev(gdpUS)
us <- ts(us,start=1960,end=2011)

Then we can generate the diffence using diff or do it manually:

us.d1 <- diff(us) # automatic

We can also do it manually

lag0 <- us[2:(length(us) - 0)] # the original series
lag1 <- us[1:(length(us) - 1)] # the series "lagged" or offset by one year
us.d1a <- lag0 - lag1
cor(us.d1,us.d1a)

[1] 1

(Note the use of the “lag” – the series \(Y_{t-1}\) is a whole sequence of Y values, just offset by 1: if the first value in \(Y_t\) is for 2011, the first value in \(Y_{t-1}\) is for 2010; if the second \(Y_t\) is for 2010, the second value in \(Y_{t-1}\) is for 2009; and so on. If you regress the \(Y_t\) vector on the \(Y_{t-1}\) vector, you are regressing each year’s GDP on the previous year’s GDP.)

Differencing 3

The goal is whether this new series is “stationary” – that is, whether it is no longer trending up or down. We can plot it and see:

plot(us.d1)

Hm – that still looks like it’s going up overall. If it is, we can difference it a second time (so that our example sequence would go from {2,-1,3} to {-3,4}), and see whether that fixes it. But how do we know when to stop? Luckily there are tests for whether a sequence is stationary, and we can just keep differencing it until it is. But rather than do that manually, we can of course just use R to tell us, via the ndiffs function in the forecast package:

#install.packages("forecast")
library(forecast)
ndiffs(us)

[1] 2

That tells us the answer is 2. So our statinary version is

us.d2 <- diff(us,2)

Regression

At this point, we could plunge directly into regression, and look at the connection between us.d2 and the equivalent ca.d2 to see whether, after detrending, these two sequences were still related – that is, whether they bounce up and down together independent of their shared tendency to go up over time.

ca <- rev(gdpCA)
ca <- ts(ca,start=1960,end=2011)
ndiffs(ca)

[1] 2

ca.d2 <- diff(ca,2)
summary(lm(us.d2 ~ ca.d2))


Call:
lm(formula = us.d2 ~ ca.d2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1636.53  -850.25    38.91   600.87  1635.10 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.317e+03  1.543e+02   8.533 3.49e-11 ***
ca.d2       2.774e-01  4.602e-02   6.029 2.27e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 900.1 on 48 degrees of freedom
Multiple R-squared:  0.4309,    Adjusted R-squared:  0.4191 
F-statistic: 36.35 on 1 and 48 DF,  p-value: 2.266e-07

So they do appear to be correlated, even after differencing twice. But which one is causing the other? One simple way to examine this is to look at which is leading and which is lagging – that is, let’s regress one on the lag of the other, and then reverse the comparison; whichever model fits better suggests which variable is the lead (cause) and which variable is the effect. Is today’s US GDP better seen as a function of yesterday’s CA GDP, or is today’s CA GDP better seen as a function of yesterday’s US GDP?

First we construct the sequences and lagged sequences:

us.d2.l0 <- us.d2[1:(length(us.d2)-1)] # original, but with one observation removed
ca.d2.l0 <- ca.d2[1:(length(ca.d2)-1)] # original, but with one observation removed
us.d2.l1 <- us.d2[2:(length(us.d2)-0)] # lag 1
ca.d2.l1 <- ca.d2[2:(length(ca.d2)-0)] # lag 1

Then we regress them on each other:

summary(lm(us.d2.l0 ~ca.d2.l1 ))


Call:
lm(formula = us.d2.l0 ~ ca.d2.l1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3243.7  -859.7   134.2   914.0  1812.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1525.3806   194.0677   7.860 4.12e-10 ***
ca.d2.l1       0.1537     0.0573   2.681   0.0101 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1115 on 47 degrees of freedom
Multiple R-squared:  0.1327,    Adjusted R-squared:  0.1142 
F-statistic:  7.19 on 1 and 47 DF,  p-value: 0.01008

summary(lm(ca.d2.l0 ~us.d2.l1 ))


Call:
lm(formula = ca.d2.l0 ~ us.d2.l1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3743.9 -1217.0  -103.2   379.7  7179.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -37.9778   614.9244  -0.062  0.95102   
us.d2.l1      0.9314     0.2790   3.338  0.00166 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2262 on 47 degrees of freedom
Multiple R-squared:  0.1916,    Adjusted R-squared:  0.1744 
F-statistic: 11.14 on 1 and 47 DF,  p-value: 0.001659

Based on this, it seems that the model fits better when we model CA as a function of the previous US GDP, rather than vice versa – a result that makes sense, since we would imagine the US is the driver in this relationship, and not vice versa.

Auto-correlation

Unfortunately, things are not quite as simple as that. Y, after all, is not just a function of X: Y is also a function of its own past values, even after differencing. A shock to Y can persist into the next period, and the one after that, and so on before going back to equilibrium. For instance, the “oil shock” in the mid-70s depressed the US economy briefly, and even when the cause of the shock (the oil price hike) disappeared, it took a while for the effects to work out of the system. So we really have to control for past values of Y when regressing Y on X, even after our differencing procedure.

More precisely, we suspect that \(Y_t\) can be a function of its lagged values, and maybe it’s 2-year lags, and maybe even more: \(Y_t = \beta_x X + \beta_1 Y_{t-1} + \beta_2 Y_{t-2} + ...\) (where here, X can be a bunch of X variables, each with their own \(\beta\)). This is known as auto-correlation (AR).

How do we know how many lags to include? We actually use a procedure that’s fairly close to multiple regression: we regress Y on all of its lags, and see which ones have an effect.

If we regress \(Y_t = \beta_1 Y_{t-1} + \beta_2 Y_{t-2} + ...\), we get the effect of year \(t-k\) on today controlling for all the intervening years. It may be that today’s GDP is correlated with the GDP in many past years, but the only reason this is so is because year 10 influences year 9 which influences year 8 … all the way up to today. In that case, the causal chain is \(Y_{t-10} \rightarrow Y_{t-9} \rightarrow ... \rightarrow Y_{t-1} \rightarrow Y_{t}\), and as usual for chained causation, controlling for the \(Y_{t-1}\) means the others to the left have no effect on \(Y_{t}\). But that relationship may not be true – it may be that \(Y_{t-2}\) directly affects \(Y_t\) and not just via \(Y_{t-1}\), perhaps because the US GDP two years ago affects China and then Japan and then back to the US in a way that takes a while to have its effect back on the US again. So what we want is how many lags have a direct, independent effect on today.

PACF

The PACF, or partial auto-correlation function, gives the output of this procedure, plotting as vertical lines the size of the \(Y_{t-k}\) coefficents (partial correlations) for a large number of lags in the single regression \(Y_t = \beta_1 Y_{t-1} + \beta_2 Y_{t-2} + ...\), where the blue line indicates which ones are significant.

pacf(us.d2)

We can replicate this more or less using regression, shown here for the first four lags:

us.d2.l0 <- us.d2[5:(length(us.d2) - 0)]
us.d2.l1 <- us.d2[4:(length(us.d2) - 1)]
us.d2.l2 <- us.d2[3:(length(us.d2) - 2)]
us.d2.l3 <- us.d2[2:(length(us.d2) - 3)]
us.d2.l4 <- us.d2[1:(length(us.d2) - 4)]
summary(lm(us.d2.l0 ~ us.d2.l1 + us.d2.l2 + us.d2.l3 + us.d2.l4))


Call:
lm(formula = us.d2.l0 ~ us.d2.l1 + us.d2.l2 + us.d2.l3 + us.d2.l4)

Residuals:
     Min       1Q   Median       3Q      Max 
-2358.01  -202.28   -37.87   347.92  1234.72 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 462.17160  212.44158   2.176 0.035410 *  
us.d2.l1      1.39860    0.15584   8.974 3.18e-11 ***
us.d2.l2     -1.14238    0.27401  -4.169 0.000154 ***
us.d2.l3      0.48005    0.35526   1.351 0.184025    
us.d2.l4      0.04511    0.22682   0.199 0.843349    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 608 on 41 degrees of freedom
Multiple R-squared:  0.7404,    Adjusted R-squared:  0.7151 
F-statistic: 29.24 on 4 and 41 DF,  p-value: 1.59e-11

We see that the first lag is the most significant, and the second is as in the picture fairly strong and negative, while the third and fourth are probably not significant. The main effect is the first lag: a shock to Y at time t has a direct effect on Y at time t+1, but mainly affects time t+2 via affecting t+1. So this may be an AR(1) process (ie, an auto-regressive process where only the first lag affects \(Y_t\)), or maybe an AR(2) process. We’ll see in a moment.

Moving average

There is one more wrinkle to the standard time model. Although it is often the case that \(Y_t\) is correlated with \(Y_{t-1}\) and possibly other lags, this can be due a third effect: neither the overall trend, nor to the direct effect of the previous Y on the next Y. Rather, Y can be a moving average (MA) of its past values. Rather than having a very short memory, where the value \(Y_t\) is only a function of \(Y_{t-1}\) and a few more lags, instead \(Y_t\) is a function of lots of past lags, all contributing very small amounts. Essentially, the moving average acts to pull Y back towards some value (eg, the natural GDP growth rate) when it is shocked away from it. Thus a good estimate of Y in the future will be an average of Y in the recent past, even if overall Y has no grand trend up or down. The result, though, is that Y again is correlated with its past values. But unlike the auto-correlation model, in the moving average model, the partial auto-correlation (PACF) of Y with its lags is signficant even out to many lags, because all those past values influence the overall moving average. By contrast, while in an AR process the bivariate (not partial) correlation of \(Y_t\) with past values \(Y_{t-1}\), \(Y_{t-2}\), etc, can extend quite a ways into the past (because a shock at time \(t\) can still influence \(Y_{t+10}\) 10 years down the line via all the intermediate \(Y\) values), by contrast for an MA process, usually only the most recent one or two \(Y_{t-k}\) values are correlated with \(Y_t\) (for reasons too complex to go into here).

So you can usually tell whether a process is AR or MA by looking at the partial auto-correlations (the PACF) and looking at the direct, bivariate correlations with past lags (known as the ACF or auto-correlation function). We can plot the ACF with R just as we did the PACF:

acf(us.d2)

Remember, this is the correlation of \(Y_t\) with \(Y_{t-1}\) (the first line), \(Y_t\) with \(Y_{t-2}\) (the second line), and so on. The PACF, by contrast, was \(Y_t = Y_{t-1} + Y_{t-2} + ...\), ie, the correlation with each lag controlling for all the others.

So when the PACF extends to only a few lags and the ACF extends out a long way (as in our example here), that usually means we have an AR process, and the number of lags the PACF extends to (here, 2) determines that we have an AR(2) process. Conversely, if the PACF extended out a long way and the ACF stopped short, we would have an MA process, with the number of significant terms in the ACF determining an MA(1), MA(2), or whatever.

ARIMA

Of course, as with all models, in the real world we can’t really assess one or the other model – AR or MA – independently. Instead, our process could be a mix of both – a little bit of auto-regression, where the immediate past values of Y influence the next values; and a little bit of moving average, where the long-term history of Y tends to have a gradual effect on the future. So to estimate the best model, we need to try a variety of mixtures of AR and MA, and see which model best fits the data (using, for instance, the Akaike Information Criterion, which is like \(R^2\) but generalizes to all kinds of models, not just linear regression).

As usual, R will do our hard work for us. In the forecast package is the auto.arima function, which will find the best mix of AR and MA. ARIMA, of course stands for AR+MA, plus the I, which stands for “integrated” but usually is equivalent to how many differencings you have to do to the data first before doing the AR and MA analysis. Thus an ARIMA (1,2,0) model would be a model with 2 differencings, an AR(1) process (ie, \(Y_t\) is partially correlated with \(Y_{t-1}\) but none of the other lags), and a MA(0) (ie, no moving average) part. A (2,1,1) ARIMA process would have an AR(2) component, 1 first-differencing, and a MA(1) component.

So what’s our Per Capita GDP then? It’s best to run it on the original series, and let the function tell us how many differences we should take:

auto.arima(us)

Series: us 
ARIMA(1,2,1)                    

Coefficients:
         ar1      ma1
      0.4194  -0.8793
s.e.  0.1576   0.0750

sigma^2 estimated as 294680:  log likelihood=-386.17
AIC=778.33   AICc=778.85   BIC=784.07

This appears to be a (1,2,1) ARIMA model: 2 differences (as we found), an AR(1), and a MA(1). While our independent analysis of the AR part found an AR(2), when we include the MA part, we instead get an AR(1) plus an MA(1) – some autocorrelation (ie, this year’s GDP is affected by last year’s), plus some moving average (ie, the whole process tends to have long-term tendencies it returns to).

Independent variables

So what happened to X? How do we model Y as a function of X variables? What causes the US GDP to surge and decline, apart from its own past values? For instance, we might want to put oil prices in there, or wars, or the GDP of other countries. When we compared the US to the Canadian time series, it looked like the US affected CA more than vice versa – but given that we weren’t controlling for all the past values of US and Canadian GDP influencing the present, that result was surely wrong.

But it turns out the ARIMA model can easily handle independent variables as well. How that is estimated is beyond the scope of this lesson (see here for more on independent variables, and here for more on ARIMA more generally). But the R code is just:

auto.arima(us,xreg=ca)

Series: us 
ARIMA(1,1,0) with drift         

Coefficients:
         ar1     drift      ca
      0.7379  659.5057  0.2216
s.e.  0.0942  168.6984  0.0228

sigma^2 estimated as 107119:  log likelihood=-360.38
AIC=728.76   AICc=729.63   BIC=736.49

As we can see, the model is slightly different now, and the “effect” of CA is positive and significant – it does seem to affect US GDP above and beyond the overall upward trend and the AR process (there is no MA in this version).

But is that the direction we think causation goes, from Canada’s GDP to the US? How about if we try it the other way:

auto.arima(ca,xreg=us)

Series: ca 
ARIMA(3,0,0) with zero mean     

Coefficients:
         ar1     ar2      ar3      us
      1.4090  0.0530  -0.4648  2.5587
s.e.  0.1298  0.2476   0.1319  0.2707

sigma^2 estimated as 1242251:  log likelihood=-442.96
AIC=895.92   AICc=897.23   BIC=905.68

Here we get yet another model, an (3,0,0) ARIMA, although again the US “effect” is strong and positive. How do we decide which of these models is right? We can reason as before: in the second case, the fit (AIC, or any of the other measures) is much higher than in the first case. So it may well be that, as we might expect, the US affects CA rather than vice versa. But confirming that would take a lot more analysis.

Prediction

Although we’ve generally been more interested in causal analysis (which X’s are affecting Y, and in what way), people are often very interested in time series data for prediction purposes. These models are very easy to extend to prediction, just as we can easily predict a \(\hat{y}\) value from some set of X values using the \(\beta\) coefficients from a regression. We won’t go into how to do it by hand here, but the auto.arima function stores in its output the coefficients to allow the predict function to predict future values.

arus <- auto.arima(us)
predict(arus,n.ahead=5)

$pred
Time Series:
Start = 2012 
End = 2016 
Frequency = 1 
[1] 51083.43 52302.07 53494.99 54677.14 55854.76

$se
Time Series:
Start = 2012 
End = 2016 
Frequency = 1 
[1]  542.8448  996.8105 1429.4254 1846.6179 2255.5708

(Of course, we can’t do this with the regression with the X values without know the future X values, or at least predicting them from their own past values.)

And we can plot it using the forecast function:

usforecast <- forecast(object = arus, h=10)
plot(usforecast)

Time to hit the stock market!