This lesson introduces some ideas in causal analysis.
After completing this module, students should be able to:
NA
In the previous module we talked about multiple independent variables as “controlling for” each other. That is, a bivariate regression might suggest that education boosts income, but if we control for parental wealth, perhaps we find that there is no relationship between education and income controlling for parental wealth. In that circumstance, we might think that parental wealth causes both education and (the child’s) income, so that any extra variation in education is uncorrelated with variation in income because for two people of the same parental wealth, they are likely to have the same income levels, regardless of their education levels.
Another example might be how income appears to affect ideology when you run a bivariate regression, but if you control for party ID, the effect of income disappears. In this, it’s not that the new variable (party ID) causes both the other X and the Y, but rather that the new variable intervenes: income seems to cause party ID (richer -> more Republican) and party ID affects ideology, but if you control for party ID – ie, if you compare two people with the same partisanship – any extra variation in income is uncorrelated with ideology.
These are two different causal pathways, although in both cases the regression results are the same: when we regress Y on X1, we get a strong relationship; but when we add X2, the effect of X1 disappears. Sometimes it’s the case that the effect of X1 disappears because X2 causes both X1 and Y. In this case, we have a spurious relationship between X1 and Y. (This is often what people mean by “correlation does not imply causation”.) Alternatively, sometimes it’s the case that the effect of X1 disappears when you control for X2 because the effect of X1 on Y is via X2, and when you control for X2, variations in X1 are unassociated with variations in Y. In this case, we have a chain relationship, where in a sense X1 does cause Y (ie, if you made some policy to boost X1, Y would change as a result), but not when controlling for X2.
Of course, there is also a third possibility, where X1 affect Y partly directly, and partly via X2. Then we have a direct and indirect effect of X1. Let’s look at these relationships graphically to make it clearer.
Here are the basic causal pathways we have discussed above:
Chain:
\(X_1 \rightarrow X_2 \rightarrow Y\)
Spurious:
\(X_1 \leftarrow X_2 \rightarrow Y\)
Direct and indirect:
\(X_1 \longrightarrow Y\)
\(\searrow\) \(\nearrow\)
\(X_2\)
The important thing is that there is not way to distinguish the first two cases from each other from the regression itself – only your expert judgment can determine whether the disappearance of the effect of X1 on Y when controlling for X2 is because X2 intervenes, or causes both. Regression doesn’t see the direction of the arrows, just which variables are correlated with which – and if you can’t see the arrow direction, chain and spurious diagrams look the same. If you find that education seems to cause happiness based on a bivariate regression, but that when you control for income, the effect of education disappears, there are two possibilities: first, education makes you happier because it boosts your income (intervening); and second, that a higher income allows you to afford more education as well as making you happier (shared cause, or a spurious relationship). Which of these is the real answer can’t be answered statistically without an experiment (where, for instance, you randomly boost people’s educations or incomes); with multiple regression, all we can say is that it is probably one or the other, and which is up to our expert substantive judgment.
Let’s illustrate this briefly using R. What we’ll do is create our X1, X2, and Y according to one of the pathways above, so that we as god know what the true causal pathway is. And then we can use regression to see what the statistical method can pick up.
First let’s start with x1:
x1 <- rnorm(1000)
Then let’s create a chain relationship, x1 -> x2 -> y1
x2 <- 3*x1 + 5 * rnorm(1000)
y1 <- 2*x2 + 5 * rnorm(1000)
What happens if we now regress y1 on x1?
summary(lm(y1~x1))
Call:
lm(formula = y1 ~ x1)
Residuals:
Min 1Q Median 3Q Max
-34.586 -6.909 0.221 7.464 36.507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4763 0.3475 -1.371 0.171
x1 5.7580 0.3558 16.181 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.98 on 998 degrees of freedom
Multiple R-squared: 0.2078, Adjusted R-squared: 0.207
F-statistic: 261.8 on 1 and 998 DF, p-value: < 2.2e-16
It looks significant. But if we include x2:
summary(lm(y1~x1+x2))
Call:
lm(formula = y1 ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-15.8178 -3.2932 -0.0214 3.2289 14.6200
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.23588 0.15067 -1.566 0.118
x1 0.07407 0.17685 0.419 0.675
x2 2.00780 0.03057 65.688 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.761 on 997 degrees of freedom
Multiple R-squared: 0.8513, Adjusted R-squared: 0.851
F-statistic: 2854 on 2 and 997 DF, p-value: < 2.2e-16
The effect of x1 disappears controlling for x2.
Incidentally, why is the coefficient on x1 in the first (bivariate) regression 6? Well, since x1 only affects y1 via x2, we know that a one-unit change in x1 boosts x2 by three units, and a one-unit change in x2 boosts y1 by 2 units: so a one-unit change in x1 boosts y1 by 3*2 = 6 units.
What about if we make the relationship between x and y spurious instead? Here we have x2 -> y2 and x2 -> x1
x2 <- rnorm(1000)
x1 <- 3*x2 + 2 * rnorm(1000)
y2 <- x2 + 2*rnorm(1000)
Now what happens when we regress y2 on x1:
summary(lm(y2~x1))
Call:
lm(formula = y2 ~ x1)
Residuals:
Min 1Q Median 3Q Max
-6.7178 -1.3353 -0.0939 1.3129 6.5162
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.09047 0.06324 1.431 0.153
x1 0.24380 0.01724 14.143 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.997 on 998 degrees of freedom
Multiple R-squared: 0.167, Adjusted R-squared: 0.1661
F-statistic: 200 on 1 and 998 DF, p-value: < 2.2e-16
It again looks significant. But again if we include x2:
summary(lm(y2~x1+x2))
Call:
lm(formula = y2 ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-6.9062 -1.2325 -0.1001 1.2621 6.8010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.089527 0.060408 1.482 0.139
x1 0.001433 0.029628 0.048 0.961
x2 1.079373 0.109694 9.840 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.908 on 997 degrees of freedom
Multiple R-squared: 0.2407, Adjusted R-squared: 0.2392
F-statistic: 158 on 2 and 997 DF, p-value: < 2.2e-16
Now x1 is again not significant: but this time it is due to a spurious rather than chain relationship. Regression can’t tell the difference – that’s up to you.
Incidentally, the coefficient on x2 in the second regression is accurate – about 1. But why is the coefficient on x1 in the bivariate regression so small? Obviously the regression estimates are never exact due to random variation. What do you think the bivariate coefficient on x1 should exactly be based on how we constructed our data?
Of course, it can (and often is) the case that x2 does not completely intervene, and x1 both causes y directly, and via x2.
x1 <- rnorm(1000)
x2 <- 3*x1 + 5*rnorm(1000)
y3 = x1 + 2*x2 + 5*rnorm(1000)
Multiple regression again gives us the correct results (ie, the true coefficient values are usually within the 95% CI’s for the regression estimates): 1 for x1 and 2 for x2:
summary(lm(y3~x1+x2))
Call:
lm(formula = y3 ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-14.8934 -3.4466 -0.1599 3.3703 15.7304
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.12950 0.16144 0.802 0.423
x1 1.09338 0.19039 5.743 1.24e-08 ***
x2 1.98512 0.03282 60.488 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.099 on 997 degrees of freedom
Multiple R-squared: 0.8526, Adjusted R-squared: 0.8523
F-statistic: 2883 on 2 and 997 DF, p-value: < 2.2e-16
But note again that if we leave off x2, we get incorrect results for x1:
summary(lm(y3~x1))
Call:
lm(formula = y3 ~ x1)
Residuals:
Min 1Q Median 3Q Max
-34.581 -7.344 -0.737 6.943 37.731
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1775 0.3485 -0.509 0.611
x1 7.3410 0.3454 21.251 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.01 on 998 degrees of freedom
Multiple R-squared: 0.3115, Adjusted R-squared: 0.3108
F-statistic: 451.6 on 1 and 998 DF, p-value: < 2.2e-16
Again – can you guess what the exact coefficient for x1 should be in the bivariate regression?
Things can in fact get even messier – much messier. Here’s one last example, and I leave it to you to puzzle out what’s going on (you can use the same logic as in the previous exercises). In this case, we have x1 causing x2 and x3, and x2 and x3 both causing y4 (but x1 has no direct effect on y4; try drawing out the arrows for yourself).
x1 <- rnorm(1000)
x2 <- 2*x1 + 5*rnorm(1000)
x3 <- -3*x1 + 5*rnorm(1000)
y4 <- x2 + x3 + 5*rnorm(1000)
If we regress y4 on x2 and x3, we get the correct coefficients:
summary(lm(y4~x1+x2+x3))
Call:
lm(formula = y4 ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-20.5238 -3.1744 -0.2375 3.1325 12.5913
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.07125 0.15244 -0.467 0.6403
x1 0.43860 0.19067 2.300 0.0216 *
x2 0.93968 0.03000 31.318 <2e-16 ***
x3 1.01394 0.03056 33.180 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.816 on 996 degrees of freedom
Multiple R-squared: 0.6897, Adjusted R-squared: 0.6887
F-statistic: 737.9 on 3 and 996 DF, p-value: < 2.2e-16
But if we do our bivariate regressions, we get “suppressed” results, where our regression coefficient is somewhat wrong (ie, the truth is in fact not within the 95% CI for the regression estimates):
summary(lm(y4~x2))
Call:
lm(formula = y4 ~ x2)
Residuals:
Min 1Q Median 3Q Max
-25.8871 -5.0121 -0.0907 4.9744 27.8141
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.12136 0.23620 -0.514 0.607
x2 0.79851 0.04339 18.403 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.462 on 998 degrees of freedom
Multiple R-squared: 0.2534, Adjusted R-squared: 0.2526
F-statistic: 338.7 on 1 and 998 DF, p-value: < 2.2e-16
summary(lm(y4~x3))
Call:
lm(formula = y4 ~ x3)
Residuals:
Min 1Q Median 3Q Max
-24.5475 -4.6520 0.0397 4.7411 21.3664
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.29893 0.22373 -1.336 0.182
x3 0.84014 0.03799 22.113 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.075 on 998 degrees of freedom
Multiple R-squared: 0.3289, Adjusted R-squared: 0.3282
F-statistic: 489 on 1 and 998 DF, p-value: < 2.2e-16
Why do you think that is? I’ll leave the calculation of the exact coefficients to you…
The important thing to bear in mind is that, while leaving out variables can cause big problems – either seeing causation when it’s not there, or thinking it’s not there when it is – adding control variables rarely hurts, and often solves these things. For instance, if we add in x1 to the regression, that doesn’t impair our estimates for the coefficients on x2 and x3, and we correctly get a non-effect for x1. So if we’re not sure, adding the variable into the regression can be a good idea, even if it appears to have no effect on y. Of course, too many variables lead to problems as well, as we will return to in later modules…
summary(lm(y4~x1+x2+x3))
Call:
lm(formula = y4 ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-20.5238 -3.1744 -0.2375 3.1325 12.5913
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.07125 0.15244 -0.467 0.6403
x1 0.43860 0.19067 2.300 0.0216 *
x2 0.93968 0.03000 31.318 <2e-16 ***
x3 1.01394 0.03056 33.180 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.816 on 996 degrees of freedom
Multiple R-squared: 0.6897, Adjusted R-squared: 0.6887
F-statistic: 737.9 on 3 and 996 DF, p-value: < 2.2e-16