Overview
This lesson introduces some ideas in causal pathway analysis and variable selection.
Objectives
After completing this module, students should be able to:
Reading
NA
In the previous module we talked about multiple independent variables as “controlling for” each other. That is, a bivariate regression might suggest that education boosts income, but if we control for parental wealth, perhaps we find that there is no relationship between education and income controlling for parental wealth. In that circumstance, we might think that parental wealth causes both education and (the child’s) income, so that any extra variation in education is uncorrelated with variation in income because for two people of the same parental wealth, they are likely to have the same income levels, regardless of their education levels.
Another example might be how income appears to affect ideology when you run a bivariate regression, but if you control for party ID, the effect of income disappears. In this, it’s not that the new variable (party ID) causes both the other X and the Y, but rather that the new variable intervenes: income seems to cause party ID (richer -> more Republican) and party ID affects ideology, but if you control for party ID – ie, if you compare two people with the same partisanship – any extra variation in income is uncorrelated with ideology.
These are two different causal pathways, although in both cases the regression results are the same: when we regress Y on X1, we get a strong relationship; but when we add X2, the effect of X1 disappears. Sometimes it’s the case that the effect of X1 disappears because X2 causes both X1 and Y. In this case, we have a spurious relationship between X1 and Y. (This is often what people mean by “correlation does not imply causation”.) Alternatively, sometimes it’s the case that the effect of X1 disappears when you control for X2 because the effect of X1 on Y is via X2, and when you control for X2, variations in X1 are unassociated with variations in Y. In this case, we have a chain relationship, where in a sense X1 does cause Y (ie, if you made some policy to boost X1, Y would change as a result), but not when controlling for X2.
Of course, there is also a third possibility, where X1 affect Y partly directly, and partly via X2. Then we have a direct and indirect effect of X1. Let’s look at these relationships graphically to make it clearer.
Here are the basic causal pathways we have discussed above:
Chain:
\(X_1 \rightarrow X_2 \rightarrow Y\)
Spurious:
\(X_1 \leftarrow X_2 \rightarrow Y\)
Direct and indirect:
\(X_1 \longrightarrow Y\)
\(\searrow\) \(\nearrow\)
\(X_2\)
The important thing is that there is not way to distinguish the first two cases from each other from the regression itself – only your expert judgment can determine whether the disappearance of the effect of X1 on Y when controlling for X2 is because X2 intervenes, or causes both. Regression doesn’t see the direction of the arrows, just which variables are correlated with which – and if you can’t see the arrow direction, chain and spurious diagrams look the same. If you find that education seems to cause happiness based on a bivariate regression, but that when you control for income, the effect of education disappears, there are two possibilities: first, education makes you happier because it boosts your income (intervening); and second, that a higher income allows you to afford more education as well as making you happier (shared cause, or a spurious relationship). Which of these is the real answer can’t be answered statistically without an experiment (where, for instance, you randomly boost people’s educations or incomes); with multiple regression, all we can say is that it is probably one or the other, and which is up to our expert substantive judgment.
Let’s illustrate this briefly using R. What we’ll do is create our X1, X2, and Y according to one of the pathways above, so that we as god know what the true causal pathway is. And then we can use regression to see what the statistical method can pick up.
First let’s start with x1:
x1 <- rnorm(1000)
Then let’s create a chain relationship, x1 -> x2 -> y1
x2 <- 3*x1 + 5 * rnorm(1000)
y1 <- 2*x2 + 5 * rnorm(1000)
What happens if we now regress y1 on x1?
summary(lm(y1~x1))
Call:
lm(formula = y1 ~ x1)
Residuals:
Min 1Q Median 3Q Max
-37.512 -7.795 0.206 7.944 33.958
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1832 0.3706 0.494 0.621
x1 6.5527 0.3531 18.559 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.71 on 998 degrees of freedom
Multiple R-squared: 0.2566, Adjusted R-squared: 0.2558
F-statistic: 344.4 on 1 and 998 DF, p-value: < 2.2e-16
It looks significant. But if we include x2:
summary(lm(y1~x1+x2))
Call:
lm(formula = y1 ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-18.0952 -3.2483 -0.0321 3.4745 13.8089
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.26213 0.16352 1.603 0.1092
x1 0.34060 0.18335 1.858 0.0635 .
x2 1.99480 0.03104 64.259 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.168 on 997 degrees of freedom
Multiple R-squared: 0.8554, Adjusted R-squared: 0.8551
F-statistic: 2949 on 2 and 997 DF, p-value: < 2.2e-16
The effect of x1 disappears controlling for x2.
Incidentally, why is the coefficient on x1 in the first (bivariate) regression 6? Well, since x1 only affects y1 via x2, we know that a one-unit change in x1 boosts x2 by three units, and a one-unit change in x2 boosts y1 by 2 units: so a one-unit change in x1 boosts y1 by 3*2 = 6 units.
What about if we make the relationship between x and y spurious instead? Here we have x2 -> y2 and x2 -> x1
x2 <- rnorm(1000)
x1 <- 3*x2 + 2 * rnorm(1000)
y2 <- x2 + 2*rnorm(1000)
Now what happens when we regress y2 on x1:
summary(lm(y2~x1))
Call:
lm(formula = y2 ~ x1)
Residuals:
Min 1Q Median 3Q Max
-6.6995 -1.4377 -0.1365 1.4196 7.4402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01048 0.06569 -0.16 0.873
x1 0.18684 0.01820 10.27 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.077 on 998 degrees of freedom
Multiple R-squared: 0.0955, Adjusted R-squared: 0.09459
F-statistic: 105.4 on 1 and 998 DF, p-value: < 2.2e-16
It again looks significant. But again if we include x2:
summary(lm(y2~x1+x2))
Call:
lm(formula = y2 ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-6.3379 -1.3230 -0.1152 1.3011 7.2530
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01414 0.06332 -0.223 0.823
x1 -0.04956 0.03211 -1.543 0.123
x2 1.00593 0.11445 8.789 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.002 on 997 degrees of freedom
Multiple R-squared: 0.1605, Adjusted R-squared: 0.1589
F-statistic: 95.34 on 2 and 997 DF, p-value: < 2.2e-16
Now x1 is again not significant: but this time it is due to a spurious rather than chain relationship. Regression can’t tell the difference – that’s up to you.
Incidentally, the coefficient on x2 in the second regression is accurate – about 1. But why is the coefficient on x1 in the bivariate regression so small? Obviously the regression estimates are never exact due to random variation. What do you think the bivariate coefficient on x1 should exactly be based on how we constructed our data?
Of course, it can (and often is) the case that x2 does not completely intervene, and x1 both causes y directly, and via x2.
x1 <- rnorm(1000)
x2 <- 3*x1 + 5*rnorm(1000)
y3 = x1 + 2*x2 + 5*rnorm(1000)
Multiple regression again gives us the correct results (ie, the true coefficient values are usually within the 95% CI’s for the regression estimates): 1 for x1 and 2 for x2:
summary(lm(y3~x1+x2))
Call:
lm(formula = y3 ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-17.2249 -3.1539 0.0805 3.2091 16.9313
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.07378 0.15752 -0.468 0.64
x1 0.93590 0.18548 5.046 5.37e-07 ***
x2 2.00290 0.03155 63.486 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.979 on 997 degrees of freedom
Multiple R-squared: 0.8626, Adjusted R-squared: 0.8624
F-statistic: 3130 on 2 and 997 DF, p-value: < 2.2e-16
But note again that if we leave off x2, we get incorrect results for x1:
summary(lm(y3~x1))
Call:
lm(formula = y3 ~ x1)
Residuals:
Min 1Q Median 3Q Max
-34.985 -7.275 -0.128 8.032 34.690
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.08907 0.35354 -0.252 0.801
x1 7.34742 0.34919 21.041 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.18 on 998 degrees of freedom
Multiple R-squared: 0.3073, Adjusted R-squared: 0.3066
F-statistic: 442.7 on 1 and 998 DF, p-value: < 2.2e-16
Again – can you guess what the exact coefficient for x1 should be in the bivariate regression?
Things can in fact get even messier – much messier. Here’s one last example, and I leave it to you to puzzle out what’s going on (you can use the same logic as in the previous exercises). In this case, we have x1 causing x2 and x3, and x2 and x3 both causing y4 (but x1 has no direct effect on y4; try drawing out the arrows for yourself).
x1 <- rnorm(1000)
x2 <- 2*x1 + 5*rnorm(1000)
x3 <- -3*x1 + 5*rnorm(1000)
y4 <- x2 + x3 + 5*rnorm(1000)
If we regress y4 on x2 and x3, we get the correct coefficients:
summary(lm(y4~x1+x2+x3))
Call:
lm(formula = y4 ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-15.0330 -3.4876 -0.3498 3.3580 18.6470
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01693 0.15571 -0.109 0.913
x1 0.16974 0.19980 0.850 0.396
x2 0.95305 0.03196 29.824 <2e-16 ***
x3 1.00602 0.03081 32.650 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.92 on 996 degrees of freedom
Multiple R-squared: 0.6761, Adjusted R-squared: 0.6751
F-statistic: 693 on 3 and 996 DF, p-value: < 2.2e-16
But if we do our bivariate regressions, we get “suppressed” results, where our regression coefficient is somewhat wrong (ie, the truth is in fact not within the 95% CI for the regression estimates):
summary(lm(y4~x2))
Call:
lm(formula = y4 ~ x2)
Residuals:
Min 1Q Median 3Q Max
-23.2185 -5.2256 0.0914 5.1671 24.6298
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.23517 0.24068 -0.977 0.329
x2 0.78583 0.04637 16.946 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.61 on 998 degrees of freedom
Multiple R-squared: 0.2234, Adjusted R-squared: 0.2227
F-statistic: 287.2 on 1 and 998 DF, p-value: < 2.2e-16
summary(lm(y4~x3))
Call:
lm(formula = y4 ~ x3)
Residuals:
Min 1Q Median 3Q Max
-24.1383 -4.5962 0.1856 4.8801 23.9326
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.11306 0.22042 -0.513 0.608
x3 0.86191 0.03723 23.150 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.966 on 998 degrees of freedom
Multiple R-squared: 0.3494, Adjusted R-squared: 0.3487
F-statistic: 535.9 on 1 and 998 DF, p-value: < 2.2e-16
Why do you think that is? I’ll leave the calculation of the exact coefficients to you…
The important thing to bear in mind is that, while leaving out variables can cause big problems – either seeing causation when it’s not there, or thinking it’s not there when it is – adding control variables rarely hurts, and often solves these things. For instance, if we add in x1 to the regression, that doesn’t impair our estimates for the coefficients on x2 and x3, and we correctly get a non-effect for x1. So if we’re not sure, adding the variable into the regression can be a good idea, even if it appears to have no effect on y. Of course, too many variables lead to problems as well, as we will return to in later modules…
summary(lm(y4~x1+x2+x3))
Call:
lm(formula = y4 ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-15.0330 -3.4876 -0.3498 3.3580 18.6470
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01693 0.15571 -0.109 0.913
x1 0.16974 0.19980 0.850 0.396
x2 0.95305 0.03196 29.824 <2e-16 ***
x3 1.00602 0.03081 32.650 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.92 on 996 degrees of freedom
Multiple R-squared: 0.6761, Adjusted R-squared: 0.6751
F-statistic: 693 on 3 and 996 DF, p-value: < 2.2e-16
How does one choose which variables to include in the model? Generally, we begin with a dependent variable we are interested in explaining, and perhaps in a few independent variables we think may affect it. But if we have lots of variables to choose from – as we do with the ANES data – how do we pick which ones to include on the right hand side?
There are a number of attempts to answer this question, but none of them are entirely satisfactory. There are a few different approaches: algorithmic (following some repeated procedure); criterion-based (following some overall measure, like adjusted \(R^2\)); or substantive.
In the first category, we have step-wise methods. Backward elimination is just to start with all the plausible variables, and one at a time remove those that have a p value above your alpha level until all remaining variables are significant. As you might imagine, this is hard to implement if you have dozens or hundreds of possible variables. The reverse is forward selection where you add them one at a time, testing at each step every unadded variable (ie, adding it to the ones you already have) and choosing to keep the one with the lowest p value. This too can become cumbersome with many variables. Stepwise regression is inbetween these two, where you add variables with low p values and drop those whose p values rise to high, in a recursive way. This too is both difficult, and often leads to over-fitting (ie, models that have been cherry-picked to fit the data).
Criterion-based approaches work not on the individual coefficient level, but on the model as a whole. One simple approach is to use the adjusted \(R^2\). While the \(R^2\) by itself is no good, since it never falls with additional variables, the adjusted \(R^2\) will fall if you add a new variable that does not add more explanatory power than you would expect by chance. Thus you add and remove variables until you find the set with the largest adjusted \(R^2\). But this too can lead to over-fitting, and is also hard to execute systematically with lots of variables. Other criteria include the Akaike Information Criterion (AIC), which like the adjusted \(R^2\), penalizes the model for adding extra variables, and the Bayes Information Criterion (BIC), which penalizes the model both for the number of variables, and the size of \(n\) (since a bigger \(n\) can sometimes spuriously increase model fit). As you may have noticed, the AIC is reported by R in lm()
, which theoretically allows comparison between models with different sets of independent variables, although here too, there is a danger of over-interpreting this. Furthermore, the best-fit model may not be the one with the best explanation.
Substantively, perhaps the best route is to select a model with the variables you believe to play important roles either in the dependent variable, or among the independent variables. For instance, in our final regression where we included the party ID measure, we saw that the income variable suddenly became insignificant. By many of these criteria, we would then drop that variable – but substantively, that would be a mistake. First, because we are actually interested in null results – when things we think might be significant in fact turn out not to be. A null result is often an important result! And second, while income may not be causing the dependent variable, it may play an important role in changing the effect of one of the other variables, such as (in this case) education. So often it is better to include all the variables that you think (based on your expert knowledge) are at play in some system, regardless of whether they are significant or strictly increase the adjusted \(R^2\) or the AIC.