Overview

This lesson introduces some techniques for detecting and correcting common issues in multiple regression, includng multicollinearity, heterskedasticity, and clustered errors.

Objectives

After completing this module, students should be able to:

Detect and correct for multicollinearity among independent variables.
Detect and correct for heteroskedasticity.
Deal with clustered errors.

Reading

None.

1 Mulicollinearity

While multiple regression is a robust and effective tool for estimating the effects of independent variables on a dependent variable, there are a number of well-known problems and pitfalls that linear regression can run into. Many of these issues are problems less for the coefficients associated with independent variable, than for the standard errors associated with those variables. This can be a serious problem, of course, because the standard error affects the statistical significance of the effect of the independent variable on the dependent variable, and thus mistaken standard errors can lead us to either incorrectly reject the null, or incorrectly fail to reject the null. Luckily, many of these pitfalls have at this point been well-explored, and many tests for these issues exist, as well as strategies for fixing or at least ameliorating them

The first issue we will discuss here is multicollinearity. The problem is best illustrated with an example: What would we expect if we regressed someone’s height on the length of their leg? Presumably a strong relationship, with a coefficient around 2. Now what would happen if we regress their height on two variables, the length of their left leg & the length of their right leg? It’s unclear what the coefficients should be both from a technical perspective (1? 2?) and from a substantive perspective (for most people, it makes no sense to imagine the effect of the right leg’s length on height controlling for left leg length). Essentially, the two legs are measuring the same thing, and we should probably just have one measure (either one leg, or the average of the two) in our model.

Let’s consider an example. What happens if we include two identical independent variables?

set.seed(1)
x1 <- rnorm(1000)
y <- x1 + 5*rnorm(1000)
x2 <- x1 # x2 is identical to x1
summary(lm(y~x1+x2))

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2422  -3.3598  -0.0688   3.7770  18.2216 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.08093    0.16452  -0.492    0.623    
## x1           1.03216    0.15904   6.490 1.35e-10 ***
## x2                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.202 on 998 degrees of freedom
## Multiple R-squared:  0.04049,    Adjusted R-squared:  0.03953 
## F-statistic: 42.12 on 1 and 998 DF,  p-value: 1.352e-10

As you can see, R simply drops one of the variables. The coefficients cannot be determined mathematically (the X matrix cannot be inverted), but it also just makes no sense, so R recognizes this and drops the second identical variable.

But it’s still a problem even if the variables are slightly different.

x2 <- x1 + 0.01*rnorm(1000) # now x2 is slightly different from x1
cor(x1,x2)

## [1] 0.9999505

summary(lm(y~x1+x2))

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2857  -3.3547  -0.1055   3.7700  18.2618 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.08272    0.16459  -0.503    0.615
## x1          -10.19909   15.99533  -0.638    0.524
## x2           11.22573   15.98667   0.702    0.483
## 
## Residual standard error: 5.204 on 997 degrees of freedom
## Multiple R-squared:  0.04097,    Adjusted R-squared:  0.03904 
## F-statistic: 21.29 on 2 and 997 DF,  p-value: 8.789e-10

This is much worse. R keeps both variables, but neither of them are statistically significant despite y being a direct function of x1 – and in fact, the coefficients are wrong as well, with x1 having a coefficient of -10 and x2 a coefficient of 11. This is not entirely incorrect – a 1-unit change in x1 & x2 will be estimated to produce a net +1 chance in y, and the f-test knows the two are jointly significant – but clearly things aren’t being reported in the way we want for a multiple regression.

What happens if x1 and x2 are slightly less correlated?

x2 <- x1 + 0.1*rnorm(1000)
cor(x1,x2)

## [1] 0.9950276

summary(lm(y~x1+x2))

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.275  -3.379  -0.104   3.726  18.127 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.08224    0.16461  -0.500    0.617
## x1           0.25962    1.59744   0.163    0.871
## x2           0.77072    1.58576   0.486    0.627
## 
## Residual standard error: 5.204 on 997 degrees of freedom
## Multiple R-squared:  0.04072,    Adjusted R-squared:  0.0388 
## F-statistic: 21.16 on 2 and 997 DF,  p-value: 9.993e-10

This is somewhat better, but the coefficients, while in the right ballpark now, are still wrong. Let’s decrease the correlation a bit more:

x2 <- x1 + 1*rnorm(1000)
cor(x1,x2)

## [1] 0.7100811

summary(lm(y~x1+x2))

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2393  -3.3766  -0.0782   3.7679  18.2613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.07988    0.16463  -0.485    0.628    
## x1           0.98244    0.22598   4.348 1.52e-05 ***
## x2           0.05161    0.16656   0.310    0.757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.205 on 997 degrees of freedom
## Multiple R-squared:  0.04059,    Adjusted R-squared:  0.03866 
## F-statistic: 21.09 on 2 and 997 DF,  p-value: 1.072e-09

Now it’s working fine.

1.1 Testing

What’s going on? Basically, when two variables are very similar, they fight for the joint effect, often leaving neither signficiant, and also destabilizing the coefficients. How correlated is too correlated? There are formal tests for this, but the most common test is somewhat informal: the Variance Inflation Factor (VIF). The VIF is basically a measure for each independent variable of how similar it is to the rest of the independent variables – ie, how truly independent it is from the others. A variable that is either highly correlated with another variable, or highly correlated with a linear combination of other variables, will cause problems for the regression due to multicollinearity (collinear means that two varables are similar, linear functions of each other).

So how do we determine how similar a variable is to all the rest? Regression, of course! We can simply regress each independent variable on all the others, and use the \(R^2\) from that regression as a proxy for how closely the other variables explain the first; if that is too high, we declare that that variable is collinear with the others (and turn to solutions in the next section).

Since we want the VIF to be inversely related to \(R^2\), the actual formula is

\[VIF_k = 1/(1-R^2_k)\] Where this is the formula for the kth variable, and \(R^2_k\) is the \(R^2\) from regressing the kth independent variable on all the others. We then need to calculate that for each independent variable, and the rule of thumb is that if the VIF is above 10 (or some say 5), then we have multicollinearity. Note from the formula above that an VIF of 10 corresponds to a \(R^2\) of 0.9, so if it’s just two variables, that means they must be fairly tightly correlated in order to be a problem.

Here are two example VIF calculations from our series of examples above. Note that in the first example the VIF is far too large – as we saw, the regression was very unstable – while in the second example it is only around 2, and we didn’t have any problems when we ran the regression.

x2 <- x1 + 0.1*rnorm(1000)
rsquared <- summary(lm(x2~x1))$r.squared
vif <- 1/(1-rsquared)
vif

## [1] 112.4062

x2 <- x1 + 1*rnorm(1000)
rsquared <- summary(lm(x2~x1))$r.squared
vif <- 1/(1-rsquared)
vif

## [1] 1.971902

There are also of course packages in R to calculate the VIF automatically, which is a boon when there are lots of independent variables and you need to run \(k\) independent regressions.

library(faraway)
vif(lm(y~x1+x2))

##       x1       x2 
## 1.971902 1.971902

1.2 Solutions

When two variables are strongly correlated, that’s often telling you something important. Multicollinearity is surprisingly common with real-world data because it is very common that you sometimes have multiple variables that are essentially measuring the same thing in different ways, like GDP in dollars in one column and then GDP in euros in another; or scores on two different kinds of tests that essentially measure the same skills; or variables for the length, width, height, and volume of a set of objects; or most commonly, three binary variables, one for each of three possible categories (eg, Democrat, Republican, or Independent). (Think about why these last two cases are examples of multicollinearity!) These sorts of issues can often be quite subtle and it’s important to detect them rather than charging through them.

The solution, therefore, is substantive, not technical. Either one must throw out the variables that are too collinear with others, taking care to choose to retain the variables that speak most direclty to one’s theory. Or one must combine the collinear variables, eg by averaging them together (as with leg lengths), once again being careful to combine them in a way that is substantively meaningful. Usually detecting multicollinearity is helpful, though, because you learn something new about your data and sometimes generate a new, better aggregate variable.

2 Heteroskedasticity

As you may recall, we didn’t derive how the standard errors on the regression coefficients are derived. But one of the mathematical assumptions behind that derivation is that the model errors \(e_i = y_i - \hat{y}_i\) are uncorrelated with the X values – eg, that accuracy of \(\hat{y}_i\) (the predicted \(y\) values) doesn’t get worse with increasing X. Generally, if we plot \(y\) vs \(x\) and the best-fit regression line, we see a cloud of \(y\) values normally distributed around the line. That is homoskedastic data. But if the data is heteroskedastic, the plot can look like this:

set.seed(1)
# Heteroskedastic data
#  Note how we construct the error so that it is correlated with (a function of) x
x <- rnorm(50,3,1)
error <- x*rnorm(50,0,1.6) + rnorm(50,0,0.5)
y <- 1 + 2*x + error
plot(x,y)
abline(1,2)

Note how the variance in \(y\) values around the line increases with increasing \(x\). This causes a problem when estimating the coefficient standard errors because a key mathematical assumption is violated. This usually results in underestimating the coefficient standard errors, which is dangerous because that can lead towards incorrectly rejecting the null (ie, false positives).

We can see that the residuals \(e_i = y_i - \hat{y}_i\) are correlated with x by plotting \(e_i\) vs \(x_i\):

lmout <- lm(y~x)
residuals <- lmout$residuals
plot(x,residuals)

2.1 Testing

We can formally test whether this is an issue by testing whether the residuals are correlated with the X values (and this means all X variables when there are more than one). Once again the simplest way to test this is with an F test where we regress the residuals squared (ie, the size of the residuals) against all the X variables:

# F test version
residlm <- lm(residuals^2 ~ x) # (Use all indep. vars. here)
summary(residlm)

## 
## Call:
## lm(formula = residuals^2 ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.37 -23.63 -10.09  11.92 219.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -38.009     23.776  -1.599  0.11646   
## x             21.028      7.412   2.837  0.00665 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.13 on 48 degrees of freedom
## Multiple R-squared:  0.1436, Adjusted R-squared:  0.1258 
## F-statistic: 8.049 on 1 and 48 DF,  p-value: 0.006649

Another common test is the Breusch-Pagan test, which is a Chi-square test which takes as its input \(n*R^2\) and \(df = k-1\) from the residual regression. Ie, the larger the \(R^2\), the more likely it is that the errors are correlated with the X values.

# Breusch-Pagan test using Chi-squared and R2
r2 <- summary(residlm)$r.squared
pchisq(length(x)*r2,1,lower.tail=F)

## [1] 0.007369277

# Automated Breusch-Pagan test
library(lmtest)
bptest(lmout)

## 
##  studentized Breusch-Pagan test
## 
## data:  lmout
## BP = 7.1807, df = 1, p-value = 0.007369

2.2 Solutions

The A common solution is to fix the math in how the coefficient standard errors are calculated. There are a number of minor variants of this approach, but basically in each method (eg “HC1” in the example below) we adjust the matrix the determines the variance of the model, which corrects the \(\beta\) coviariance matrix and hence the estmiated standard errors associated with each \(\beta\) coefficient.

library(sandwich)
coeftest(lmout, vcov = vcovHC(lmout, type="HC1"))

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.58032    2.66918  0.5921  0.55659  
## x            1.96004    0.98226  1.9954  0.05169 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note the difference in standard errors between the corrected version above and the uncorrected version below:

summary(lmout)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.238 -3.866 -0.041  2.752 16.619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.5803     2.9333   0.539   0.5925  
## x             1.9600     0.9144   2.143   0.0372 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.322 on 48 degrees of freedom
## Multiple R-squared:  0.08736,    Adjusted R-squared:  0.06834 
## F-statistic: 4.594 on 1 and 48 DF,  p-value: 0.03717

But one should be aware that there is often a reason why errors might be correlated with X, and figuring out why sometimes reveals something important about your data that you overlooked. Why would errors in predicting vote intention (eg) increase with the age of the respondent? There might be something important there that should be added to your model.

Another common source of heterskedasticity is when your variables are very non-normally distributed, as is often the case with variables like income that have a huge mass near 0 and a long tail (the richest 1%); often in these cases, the best fix can be to log the independent or dependent variables to make their values more normally distributed, which can in turn fix heterskedasticity as verified by one of the tests, so that you don’t need to then correct the regression output.

3 Clustered errors

The third issue here is a common modeling problem, but not really a technical error. Often it is the case that one’s variables are clustered into groups – individuals who are from different nations, students from different schools, experimental data collected on different days – and if you don’t carefully model the way that individuals cluster into those groups, you can run into problems.

For instance, say you are looking at the relationship between SAT scores and college performance based on a survey of students from three schools. You find a strong relationship if you regress college grades on SAT scores, but what you didn’t account for is that the three high schools are of very different quality, and in reality it’s just that the students from the good high school do much better on both the SATs and in college, and in fact if you controlled for high school effect, there would be no independent relationship (within schools) between SAT scores and college performance.

As you can see, this is fundamentally a question of getting your model right, and it’s often fixed by just including a fixed effect (“dummy” or binary variable) for each group (high school) as a control variable in your regression. But sometimes you’re not sure if that’s the right approach, or unsure whether the data are truly clustered. So you can test for and even somewhat “correct” for clustered data.

Consider the following example, where we artificially construct data to have clustered errors. In this example, there are 3 groups (ng) of 100 each, each group with a different group mean (smean) but the same standard deviation (ssd).

set.seed(2)
smean <- 1:3
n <- rep(100,3)
ssd <- rep(.1,3)
dat <- c()
for(i in 1:3){
  dat <- rbind(dat, data.frame(x=rnorm(n[i],smean[i],ssd[i]),g=as.factor(rep(i,n[i]))))
}
dat$y <- dat$x + 5*rnorm(300)
plot(y~x,data=dat)

If you regress \(y\) on \(x\), you get a statistically significant coefficient on \(x\):

lmout <- lm(y~x,data=dat)
summary(lmout)

## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3043  -3.2177  -0.1543   3.5755  12.9660 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0704     0.7450   1.437   0.1518  
## x             0.8148     0.3433   2.373   0.0183 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.938 on 298 degrees of freedom
## Multiple R-squared:  0.01855,    Adjusted R-squared:  0.01526 
## F-statistic: 5.632 on 1 and 298 DF,  p-value: 0.01827

When we correct for clustered errors, we are basically fixing our regression to take into account the fact that we have less information than it looks. If the students are highly correlated by school, eg, we effectively have a much smaller \(n\): in fact, if there is no within-school correlation with \(y\) and only between-school correlation with \(y\) and we have three schools, then our effective \(n\) is something closer to 3!

The error correction approach is similar to what we do with heterskedasticity: we correct the error covariance matrix, which is the basis of our calculation of the standard errors of the \(\beta\) coefficients. In order to do this, we need to also let it know what the group variable is, \(g\):

# Correction
library(multiwayvcov)
coeftest(lmout, vcov = cluster.vcov(lmout, dat$g))

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.07043    1.34380  0.7966   0.4263
## x            0.81478    0.58343  1.3965   0.1636

Note that here our corrected standard error moves the \(x\) coefficient from significant to insignificant, though the coefficient value itself does not change. In other cases, though, the coefficient can change from insignificant to significant. What this is telling us in this case is that there is no relationship between \(x\) and \(y\) when controlling for group, which is how we constructed the data in the first place.

However, a better is approach is simply to model both the individual (x) and group (g) effects together, to capture both. Here we use a dummy (binary) variable for each g group (automatically provided by R; note that R omits one of the groups, since otherwise the three binary variables would be perfectly collinear!).

lmout <- lm(y~x + g,data=dat)
summary(lmout)

## 
## Call:
## lm(formula = y ~ x + g, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8218  -3.5174  -0.0891   3.3534  12.5836 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.80536    2.66369   1.429    0.154
## x           -1.24174    2.62735  -0.473    0.637
## g2          -0.05078    2.73041  -0.019    0.985
## g3           4.21421    5.34440   0.789    0.431
## 
## Residual standard error: 4.843 on 296 degrees of freedom
## Multiple R-squared:  0.06223,    Adjusted R-squared:  0.05272 
## F-statistic: 6.547 on 3 and 296 DF,  p-value: 0.000267

Once again we see that the \(x\) variable has no effect when controlling for \(g\), and though it is not obvious from the individual \(g\) dummy coefficients (g2 and g3), the f test for the total regression shows that there is a significant relationship between the variables and \(y\) – ie, via the \(g\) groups.