<

Causality

Overview

This lesson introduces some ideas in causal analysis.

Objectives

After completing this module, students should be able to:

  1. Distinguish various causal pathways, including spurious causes and chained causation.
  2. Use regression and R to disentangle causal interactions between independent variables.

Readings

NA

Causal pathways

In the previous module we talked about multiple independent variables as “controlling for” each other. That is, a bivariate regression might suggest that education boosts income, but if we control for parental wealth, perhaps we find that there is no relationship between education and income controlling for parental wealth. In that circumstance, we might think that parental wealth causes both education and (the child’s) income, so that any extra variation in education is uncorrelated with variation in income because for two people of the same parental wealth, they are likely to have the same income levels, regardless of their education levels.

Another example might be how income appears to affect ideology when you run a bivariate regression, but if you control for party ID, the effect of income disappears. In this, it’s not that the new variable (party ID) causes both the other X and the Y, but rather that the new variable intervenes: income seems to cause party ID (richer -> more Republican) and party ID affects ideology, but if you control for party ID – ie, if you compare two people with the same partisanship – any extra variation in income is uncorrelated with ideology.

These are two different causal pathways, although in both cases the regression results are the same: when we regress Y on X1, we get a strong relationship; but when we add X2, the effect of X1 disappears. Sometimes it’s the case that the effect of X1 disappears because X2 causes both X1 and Y. In this case, we have a spurious relationship between X1 and Y. (This is often what people mean by “correlation does not imply causation”.) Alternatively, sometimes it’s the case that the effect of X1 disappears when you control for X2 because the effect of X1 on Y is via X2, and when you control for X2, variations in X1 are unassociated with variations in Y. In this case, we have a chain relationship, where in a sense X1 does cause Y (ie, if you made some policy to boost X1, Y would change as a result), but not when controlling for X2.

Of course, there is also a third possibility, where X1 affect Y partly directly, and partly via X2. Then we have a direct and indirect effect of X1. Let’s look at these relationships graphically to make it clearer.

Graphical analysis

Here are the basic causal pathways we have discussed above:

Chain:

\(X_1 \rightarrow X_2 \rightarrow Y\)

Spurious:

\(X_1 \leftarrow X_2 \rightarrow Y\)

Direct and indirect:

\(X_1 \longrightarrow Y\)

    \(\searrow\)   \(\nearrow\)

        \(X_2\)

The important thing is that there is not way to distinguish the first two cases from each other from the regression itself – only your expert judgment can determine whether the disappearance of the effect of X1 on Y when controlling for X2 is because X2 intervenes, or causes both. Regression doesn’t see the direction of the arrows, just which variables are correlated with which – and if you can’t see the arrow direction, chain and spurious diagrams look the same. If you find that education seems to cause happiness based on a bivariate regression, but that when you control for income, the effect of education disappears, there are two possibilities: first, education makes you happier because it boosts your income (intervening); and second, that a higher income allows you to afford more education as well as making you happier (shared cause, or a spurious relationship). Which of these is the real answer can’t be answered statistically without an experiment (where, for instance, you randomly boost people’s educations or incomes); with multiple regression, all we can say is that it is probably one or the other, and which is up to our expert substantive judgment.

R simulation: chained causation

Let’s illustrate this briefly using R. What we’ll do is create our X1, X2, and Y according to one of the pathways above, so that we as god know what the true causal pathway is. And then we can use regression to see what the statistical method can pick up.

First let’s start with x1:

x1 <-  rnorm(1000)

Then let’s create a chain relationship, x1 -> x2 -> y1

x2 <- 3*x1 + 5 * rnorm(1000)
y1 <- 2*x2 + 5 * rnorm(1000)

What happens if we now regress y1 on x1?

summary(lm(y1~x1))

Call:
lm(formula = y1 ~ x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.586  -6.909   0.221   7.464  36.507 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.4763     0.3475  -1.371    0.171    
x1            5.7580     0.3558  16.181   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.98 on 998 degrees of freedom
Multiple R-squared:  0.2078,    Adjusted R-squared:  0.207 
F-statistic: 261.8 on 1 and 998 DF,  p-value: < 2.2e-16

It looks significant. But if we include x2:

summary(lm(y1~x1+x2))

Call:
lm(formula = y1 ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.8178  -3.2932  -0.0214   3.2289  14.6200 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.23588    0.15067  -1.566    0.118    
x1           0.07407    0.17685   0.419    0.675    
x2           2.00780    0.03057  65.688   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.761 on 997 degrees of freedom
Multiple R-squared:  0.8513,    Adjusted R-squared:  0.851 
F-statistic:  2854 on 2 and 997 DF,  p-value: < 2.2e-16

The effect of x1 disappears controlling for x2.

Incidentally, why is the coefficient on x1 in the first (bivariate) regression 6? Well, since x1 only affects y1 via x2, we know that a one-unit change in x1 boosts x2 by three units, and a one-unit change in x2 boosts y1 by 2 units: so a one-unit change in x1 boosts y1 by 3*2 = 6 units.

Spurious association

What about if we make the relationship between x and y spurious instead? Here we have x2 -> y2 and x2 -> x1

x2 <- rnorm(1000)
x1 <- 3*x2 + 2 * rnorm(1000)
y2 <- x2 + 2*rnorm(1000)

Now what happens when we regress y2 on x1:

summary(lm(y2~x1))

Call:
lm(formula = y2 ~ x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7178 -1.3353 -0.0939  1.3129  6.5162 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.09047    0.06324   1.431    0.153    
x1           0.24380    0.01724  14.143   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.997 on 998 degrees of freedom
Multiple R-squared:  0.167, Adjusted R-squared:  0.1661 
F-statistic:   200 on 1 and 998 DF,  p-value: < 2.2e-16

It again looks significant. But again if we include x2:

summary(lm(y2~x1+x2))

Call:
lm(formula = y2 ~ x1 + x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9062 -1.2325 -0.1001  1.2621  6.8010 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.089527   0.060408   1.482    0.139    
x1          0.001433   0.029628   0.048    0.961    
x2          1.079373   0.109694   9.840   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.908 on 997 degrees of freedom
Multiple R-squared:  0.2407,    Adjusted R-squared:  0.2392 
F-statistic:   158 on 2 and 997 DF,  p-value: < 2.2e-16

Now x1 is again not significant: but this time it is due to a spurious rather than chain relationship. Regression can’t tell the difference – that’s up to you.

Incidentally, the coefficient on x2 in the second regression is accurate – about 1. But why is the coefficient on x1 in the bivariate regression so small? Obviously the regression estimates are never exact due to random variation. What do you think the bivariate coefficient on x1 should exactly be based on how we constructed our data?

0
No. Though 0 is the correct answer controlling for x2, as you can see by the p value, the bivariate regression gets a significant but non-zero result.
0.10
Nope.
0.33
Yes. A one-unit change in x2 produces a one-unit change in y2, but the ratio between x1 and x2 is 1 to 3: a one-unit change in x2 is associated with a 3-unit change in x1. So a 3-unit change in x1 is therefore associated with a one-unit change in y2. Ie, the cofficient on x1 is 1/3. This of course is causally wrong – x1 actually has no effect on y2 at all; but it shows the danger of trusting in bivariate regressions!

Direct and indirect

Of course, it can (and often is) the case that x2 does not completely intervene, and x1 both causes y directly, and via x2.

x1 <- rnorm(1000)
x2 <- 3*x1 + 5*rnorm(1000)
y3 = x1 + 2*x2 + 5*rnorm(1000)

Multiple regression again gives us the correct results (ie, the true coefficient values are usually within the 95% CI’s for the regression estimates): 1 for x1 and 2 for x2:

summary(lm(y3~x1+x2))

Call:
lm(formula = y3 ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.8934  -3.4466  -0.1599   3.3703  15.7304 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.12950    0.16144   0.802    0.423    
x1           1.09338    0.19039   5.743 1.24e-08 ***
x2           1.98512    0.03282  60.488  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.099 on 997 degrees of freedom
Multiple R-squared:  0.8526,    Adjusted R-squared:  0.8523 
F-statistic:  2883 on 2 and 997 DF,  p-value: < 2.2e-16

But note again that if we leave off x2, we get incorrect results for x1:

summary(lm(y3~x1))

Call:
lm(formula = y3 ~ x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.581  -7.344  -0.737   6.943  37.731 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.1775     0.3485  -0.509    0.611    
x1            7.3410     0.3454  21.251   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.01 on 998 degrees of freedom
Multiple R-squared:  0.3115,    Adjusted R-squared:  0.3108 
F-statistic: 451.6 on 1 and 998 DF,  p-value: < 2.2e-16

Again – can you guess what the exact coefficient for x1 should be in the bivariate regression?

5
Nope.
6
Nope.
7
Yup. A one-unit change in x1 produces directly a one-unit change in y3 – that’s the direct effect. But in addition, a one-unit change in x1 produces a 3-unit change in x2, and a one-unit change in x2 produces a 2-unit change in y3, so a one-unit change in x1 also produces a 6 unit indirect effect in y3. Thus the total effect of a one-unit change in x1 is a 7-unit change in y3.

Suppression

Things can in fact get even messier – much messier. Here’s one last example, and I leave it to you to puzzle out what’s going on (you can use the same logic as in the previous exercises). In this case, we have x1 causing x2 and x3, and x2 and x3 both causing y4 (but x1 has no direct effect on y4; try drawing out the arrows for yourself).

x1 <- rnorm(1000)
x2 <- 2*x1 + 5*rnorm(1000)
x3 <- -3*x1 + 5*rnorm(1000)
y4 <- x2 + x3 + 5*rnorm(1000)

If we regress y4 on x2 and x3, we get the correct coefficients:

summary(lm(y4~x1+x2+x3))

Call:
lm(formula = y4 ~ x1 + x2 + x3)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.5238  -3.1744  -0.2375   3.1325  12.5913 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.07125    0.15244  -0.467   0.6403    
x1           0.43860    0.19067   2.300   0.0216 *  
x2           0.93968    0.03000  31.318   <2e-16 ***
x3           1.01394    0.03056  33.180   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.816 on 996 degrees of freedom
Multiple R-squared:  0.6897,    Adjusted R-squared:  0.6887 
F-statistic: 737.9 on 3 and 996 DF,  p-value: < 2.2e-16

But if we do our bivariate regressions, we get “suppressed” results, where our regression coefficient is somewhat wrong (ie, the truth is in fact not within the 95% CI for the regression estimates):

summary(lm(y4~x2))

Call:
lm(formula = y4 ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.8871  -5.0121  -0.0907   4.9744  27.8141 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.12136    0.23620  -0.514    0.607    
x2           0.79851    0.04339  18.403   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.462 on 998 degrees of freedom
Multiple R-squared:  0.2534,    Adjusted R-squared:  0.2526 
F-statistic: 338.7 on 1 and 998 DF,  p-value: < 2.2e-16
summary(lm(y4~x3))

Call:
lm(formula = y4 ~ x3)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.5475  -4.6520   0.0397   4.7411  21.3664 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.29893    0.22373  -1.336    0.182    
x3           0.84014    0.03799  22.113   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.075 on 998 degrees of freedom
Multiple R-squared:  0.3289,    Adjusted R-squared:  0.3282 
F-statistic:   489 on 1 and 998 DF,  p-value: < 2.2e-16

Why do you think that is? I’ll leave the calculation of the exact coefficients to you…

The important thing to bear in mind is that, while leaving out variables can cause big problems – either seeing causation when it’s not there, or thinking it’s not there when it is – adding control variables rarely hurts, and often solves these things. For instance, if we add in x1 to the regression, that doesn’t impair our estimates for the coefficients on x2 and x3, and we correctly get a non-effect for x1. So if we’re not sure, adding the variable into the regression can be a good idea, even if it appears to have no effect on y. Of course, too many variables lead to problems as well, as we will return to in later modules…

summary(lm(y4~x1+x2+x3))

Call:
lm(formula = y4 ~ x1 + x2 + x3)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.5238  -3.1744  -0.2375   3.1325  12.5913 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.07125    0.15244  -0.467   0.6403    
x1           0.43860    0.19067   2.300   0.0216 *  
x2           0.93968    0.03000  31.318   <2e-16 ***
x3           1.01394    0.03056  33.180   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.816 on 996 degrees of freedom
Multiple R-squared:  0.6897,    Adjusted R-squared:  0.6887 
F-statistic: 737.9 on 3 and 996 DF,  p-value: < 2.2e-16