Overview
This lesson introduces multiple regression.
Objectives
After completing this module, students should be able to:
Reading
Lander, Chapter 16; Schumacker, Chapter 17 through p. 402.
In the previous module we introduced bivariate regression for modeling the effect of a single independent variable X on our dependent variable Y. But what do we mean by “the effect”? Most fundamentally, we are interested in causality – what is the effect of changing X on Y? This is both a scientific question – how are they causally related – and a policy question: how can we change outcome Y by changing policy/action X? If X doesn’t actually cause Y, and instead they are (for instance) both causes by some third factor Z, then we might find a nice relationship between X and Y, but if we actually intervened and boosted X, Y wouldn’t change.
For instance, imagine we regress income on education level in a large, nationally-representative sample, and find a statistically signficant positive relationship. We might conclude that education boosts income, and from a policy point of view, you might conclude that if you go back to school and get a couple more years of education, your income will go up. But what if it’s the case that some third factor instead affects both? For instance, it may be that harder-working people both get more education, and work their way up in any subsequent jobs to get higher pay. Or perhaps people with wealthy parents both can afford more education, and have the social connections to get fancier jobs. In both of these cases, we might see a strong correlation and significant \(\beta\) relating education to income, but we might find that, had we invested the extra years in education, our income would be no higher than our peers with equal work ethics and similarly wealthy parents. That is, the education (in this scenario) didn’t in fact cause the income boost, but rather it just gave the appearance of it, due to the fact that education was also tracking the real causes, hard work and parental income.
This is why multiple regression is so important. With multiple regression, we can control for these other variables, to get the real effect of education on income (for instance). From a policy or causal point of view, we want to know what happens to our dependent variable (income) if we change just a single independent variable (education), leaving all the others constant. In the real world, these variables are not independent – people with higher education are also those who work harder and have wealthier parents – but we can leverage what little variation there is in the data to try to tease out the effect of the variable we care about (education) from the variables we don’t care about or can’t change (parental income; work ethic). This is what multiple regression does.
The multiple regression model is fundamentally very like the bivariate regression model:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \epsilon\]
That is, we assume our dependent variable is a linear function of a bunch of different independent variables. If you boost one of those independent variables by 1 unit (in whatever units that variable is measured with), then Y will increase by \(\beta_i\). Eg:
\[(\beta_0 + \beta_1 (x_1+1) + \beta_2 x_2) - (\beta_0 + \beta_1 x_1 + \beta_2 x_2) = \beta_1\]
Here, going from a value of \(x_1\) to \(x_1 + 1\), we see that our predicted \(\hat{y}\) has increased by \(\beta_1\). This is just saying that \(\beta_1\) is not, in the multiple regression case, like the simple correlation between \(X_1\) and \(Y\), but rather the like the partial correlation between \(X_1\) and \(Y\). The partial correlation is the correlation between \(X_1\) and \(Y\) when you hold all the other variables constant; similarly, \(\beta_1\) is the effect of \(X_1\) on \(Y\) when you hold all the other variables constant.
We’ll turn to the derivation of these coefficients in a moment, but let’s first consider a real-world example in some detail.
Let’s consider some real-world data. Say we are interested in the effect of education on ideology. That is, we want to know the effect of education holding all the other possible confounding variables constant. Ideally, we would do an experiment: take a sample of people, randomly give half of them an extra year of education, and then see what the ideological differences are between the two groups afterwards. Obviously such an approach is often impossible in social research. So instead we turn to surveys. The American National Election Survey (ANES), for instance, surveys a nationally representative sample of Americans every two years and asks them a wide variety of questions about politics and their own backgrounds. The data is available from many places online for free, and ranges back to when the survey started in 1948. Let’s take this data and prepare it for a multiple regression analysis of the effect of education on ideology.
# First read the giant table, containing hundreds of variables over many decades
# Data obtained from: http://www.icpsr.umich.edu/icpsrweb/ICPSR/series/3/studies/8475
anes_all <- read.table("ANES-1948-2008/ANES-Data.tsv",sep="\t",header=TRUE,stringsAsFactors=FALSE)
# Keep only the latest year, 2008
anes_2008 <- anes_all[anes_all$VCF0004==2008,]
# Keep only a subset of variables
anes_2008tr <- data.frame(age=anes_2008$VCF0101,race_white=anes_2008$VCF0106,
gender_male=anes_2008$VCF0104,education=anes_2008$VCF0140a,
income=anes_2008$VCF0114,ideology_con=anes_2008$VCF0803,
partyid_rep=anes_2008$VCF0301,vote_rep=anes_2008$VCF0704a,
voted=anes_2008$VCF0702)
Before we can move on to the regression stage, let’s spend a little time going over how we prepare the data. So far we’ve been cheating a bit and working mainly with fake or very simple datasets, but it’s worth spending a little time now that we are developing the tools for working with real data.
Note that ANES, like many large surveys, has unhelpful variables names, mainly because there are nearly 1000 of them in the full dataset. To figure out which ones I want I consulted the codebook that comes with the dataset, which explains each variable and what its numeric codes stand for. For instance, for education there are actually a number of different questions in the dataset that cover education, using a variety of different sub-categories. I found and chose to use the one with the most categories, VCF0140a:
VCF0140a R Education 7-category
=============================================================================
QUESTION:
---------
1952-1972: How many grades of school did you finish?
1974 AND LATER: What is highest grade of school or year of college you have completed? Did you get a high school diploma or pass a high school equivalency test?
1974,1976: Do you have a college degree? (IF YES:) What degree is that?
1978-1984: Do you have a college degree? (IF YES:) What is the highest degree that you have earned?
1986 AND LATER: What is the highest degree that you have earned?
VALID_CODES:
------------
1. 8 grades or less ('grade school')
2. 9-12 grades ('high school'), no diploma/equivalency
3. 12 grades, diploma or equivalency
4. 12 grades, diploma or equivalency plus non-academic
training
5. Some college, no degree; junior/community college
level degree (AA degree)
6. BA level degrees
7. Advanced degrees incl. LLB
MISSING_CODES:
--------------
8. DK
9. NA; RF; no Pre IW; short-form 'new' Cross Section (1992) Inap. question not used
Note the presence of the “DK” (don’t know) and NA (not available) categories. These things are annoying but important to watch out for. As you can see, the education level is increasing in the code number from 1-7, but 8 and 9 are DK and NA; if we just threw the data in as coded, we would be mixing in the effects of going up one unit, eg from Some College to BA, with the effect of going “up” from Advanced Degrees (7) to Don’t Know (8). Obviously the latter would be a mistake, since we don’t think the DK’s are actually more educated than the Advanced Degrees. So we need to either drop those 8 and 9’s, or recode them as the mean value, which shouldn’t bias our outcomes too much.
So before we move on to the analysis, we need to recode all the DK’s, NA’s, and the rest. Again, some people prefer to drop any observation that has a NA or DK in it, but as you can imagine, the problem here is that if you have 10 variables and on average each person doesn’t answer one question, then everyone gets dropped and you have no data! There are entire courses on how to best impute these missing values, but probably the safest route is to use the mean value for that variable, which at worst tends to degrade your data without biasing your results.
So in the next stage, we recode each variable in a couple ways. First, we reassign the DK/NA’s to the mean for each variable. Second, we recode each variable so that it is clear exactly what it means. For instance, a variable labeled “gender” is confusing to interpret: is it coded as 1 for male, 2 for female; or 0 for female, 1 for male; or what? If we see a coefficient, we won’t know how to interpret it; we know it means a 1 unit increase in X produces a \(\beta\) increase in Y, but what is the unit of X?
Thus we need to be as careful as we can in naming our variables. As you can in the next slide, I create a race variable that is called race_white
which is a binary variable that equals 0 for all non-white respondents and 1 for white respondents. I do the same for gender_female
, and in both cases this requires recoding the ANES 1/2 scale to a more informative 0/1 scale where you know exactly from the name of the variable what it means. This is less clear for education, income, and ideology, but at least for the first two we know that increasing values are always more (education or income), while for ideology, I call it ideology_con
to remind us that a higher value is more conservative (as the ANES decided to code it).
Similarly, partyid_rep
is the party ID of the respondent, with Republican being the higher value; vote_rep
is whether they voted for the Republican presidential candidate (1) in the last election or the Democrat (0); and voted
is a binary if they voted in the last election (1) or not.
Again, note how the direction and basic meaning of these variables is fairly clear from their names, which is essential to keeping track of your results during complex analyses. But also note that fundamentally we will always need to keep our coding scheme handy. For instance, while we’ve defined education in the previous slide, how is income coded? Rather than raw dollars, it turns out that for comparison across years (where inflation changes things) ANES has coded it by percentiles:
VALID_CODES:
------------
1. 0 to 16 percentile
2. 17 to 33 percentile
3. 34 to 67 percentile
4. 68 to 95 percentile
5. 96 to 100 percentile
So in interpreting what a 1-unit change means (ie, what \(\beta_{income}\) means), we have to bear in mind that it is a change in the percentile bracket of the respondent, not raw income.
More generally, the point here is not to learn the ANES coding scheme, but that real-world data will always have these coding oddities, and much of our time is actually spent tidying it up for our particular needs before we can proceed. In particular, dealing with DK/NA’s and recoding data into sensible, well-named variables is essential. There are many different ways to do it and mostly it is up to you, but the important thing is to have a coding system that you know and can readily interpret.
So here is the R code for recoding our variables:
anes_2008tr$race_white[anes_2008tr$race_white != 1] <- 0 # non-white to 0
anes_2008tr$gender_male[anes_2008tr$gender_male == 2] <- 0 # non-male to 0
anes_2008tr$education[anes_2008tr$education > 7] <- 3 # DK/NA to middle
anes_2008tr$income[anes_2008tr$income == 0] <- 3 # DK/NA to middle
anes_2008tr$ideology_con[anes_2008tr$ideology_con == 9] <- 4 # DK to middle
anes_2008tr$ideology_con[anes_2008tr$ideology_con == 0] <- 4 # NA to middle
anes_2008tr$partyid_rep[anes_2008tr$partyid_rep == 0] <- 4 # DK/NA to middle
anes_2008tr$vote_rep <- anes_2008tr$vote_rep - 1 # 1/2 -> 0/1
anes_2008tr$vote_rep[anes_2008tr$vote_rep == -1] <- NA # DK/NA to NA
anes_2008tr$voted <- anes_2008tr$voted - 1 # 1/2 -> 0/1
anes_2008tr$voted[anes_2008tr$voted == -1] <- 0 # DK/NA to 0
Finally, we save our laboriously recoded data before moving on to the analysis:
write.table(anes_2008tr,"anes_2008tr.csv",sep=",",row.names=FALSE)
This dataset is available in Course Resources if you want to play with it yourself.
Let’s reload the data and return to our original question: the effect of education on ideology.
anes_2008tr <- read.table("anes_2008tr.csv",sep=",",header=TRUE,stringsAsFactors=FALSE)
We begin with a bivariate regression. (We use the summary
command to summarize the lm
(linear model) output because summary
gives more information in a more readable format.)
bv1 <- lm(ideology_con ~ education,data=anes_2008tr)
summary(bv1)
Call:
lm(formula = ideology_con ~ education, data = anes_2008tr)
Residuals:
Min 1Q Median 3Q Max
-3.2079 -0.1727 -0.1374 0.8626 3.0034
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.24309 0.07358 57.665 <2e-16 ***
education -0.03522 0.01661 -2.121 0.0341 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.291 on 2320 degrees of freedom
Multiple R-squared: 0.001935, Adjusted R-squared: 0.001504
F-statistic: 4.497 on 1 and 2320 DF, p-value: 0.03406
The result is that education has a negative effect, but just barely (p = 0.0341), and the effect is small. Increasing 1 unit in education (ie, from high-school to college, or college to a higher degree) causes ideology_con to decrease by 0.035 units. Since ideology is a 7-point scale from Strongly Liberal to Strongly Conservative, this is a tiny shift leftward (negative, as coded here) for each large step up in education. Furthermore, the \(R^2\) is very small, at 0.002; very little of the variation in ideology is explained by education.
But what about our previous concern, that education may not be having any effect at all, and may be just along for the ride, as it were? Perhaps income is affecting both education and ideology – the more income you have, the higher your education, and the more liberal or conservative your ideology? So let’s add ideology into the model:
mr1 <- lm(ideology_con ~ education + income,data=anes_2008tr)
summary(mr1)
Call:
lm(formula = ideology_con ~ education + income, data = anes_2008tr)
Residuals:
Min 1Q Median 3Q Max
-3.3406 -0.3224 -0.0764 0.7915 3.2356
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.03765 0.08588 47.015 < 2e-16 ***
education -0.06604 0.01785 -3.700 0.000221 ***
income 0.12299 0.02682 4.585 4.77e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.285 on 2319 degrees of freedom
Multiple R-squared: 0.0109, Adjusted R-squared: 0.01005
F-statistic: 12.78 on 2 and 2319 DF, p-value: 3.019e-06
Notice first that income appears to have a positive effect: ie, the more income you make, the more conservative you are. And this effect appears to be a bit higher than education, though it is very tricky to compare the relative effects of different variables, since they are measured in different units. (There are methods designed to do this, where you rescale every variable to have a mean of 0 and a sd of 1 and then run the regression, producing “standardized betas” whose sizes can be directly compared with each other. But this has its own problems, and is rarely done these days. Generally we try not to say which variable has the most impact, and just talk about them individually; or if we are interested in policy interventions, we can still choose which one would give us the most leverage over the dependent variable given whatever our finite resources are.)
More interestingly, note that education now has a stronger effect than before – almost twice as large, with a much lower p-value. Why is this? Presumably it is because education independent of income has a strong(ish) liberal-izing effect, but of course in practice, people with higher education levels also have higher incomes and income independent of education has a conservative-izing effect. So if you just look at the relationship between education and ideology, you find a weak or non-existent effect, because people with higher educations (more liberal) also have higher incomes (more conservative), and those two things almost cancel out. But if you control for income, you are looking at the effect of education between people of the same income, which is to say, you are determining the effect of education independent of income.
Of course, in real life, education and income go together for most people, so the varation in the data – those people with high education and low income or vice versa – reflects precisely those people who are a bit unusual. So substantively, we should always be a bit wary of over-interpreting these things, if they are being based on odd-balls. But that’s often the best we can do, and it does illuminate the key difference between the bivariate effect and the multiple-regression effect controlling for other variables.
Of course, income is only one of the possible confounding variables that might be affecting both education (our independent variable of interest) and ideology (our dependent variable). What other factors might play a role in ideology? Some of the basic demographics certainly affect both:
mr2 <- lm(ideology_con ~ age + gender_male + race_white +
education + income,data=anes_2008tr)
summary(mr2)
Call:
lm(formula = ideology_con ~ age + gender_male + race_white +
education + income, data = anes_2008tr)
Residuals:
Min 1Q Median 3Q Max
-3.5159 -0.4137 -0.0326 0.6346 3.3294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.680935 0.114812 32.061 < 2e-16 ***
age 0.005543 0.001456 3.807 0.000144 ***
gender_male 0.111990 0.053669 2.087 0.037026 *
race_white 0.277814 0.055208 5.032 5.22e-07 ***
education -0.073291 0.018011 -4.069 4.88e-05 ***
income 0.101125 0.026980 3.748 0.000182 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.272 on 2316 degrees of freedom
Multiple R-squared: 0.03202, Adjusted R-squared: 0.02993
F-statistic: 15.32 on 5 and 2316 DF, p-value: 7.635e-15
As expected, the demographic variables also affect ideology. Increasing age makes one more conservative (eg, 10 years boosts your score by 0.055; make sure you understand why!); being male has a barely significant effect in the conservative direction; being white, on the other hand, has a much stronger effect. We can compare the latter two directly since both are binary and thus in the same units: while being male increases your conservatism by a tenth of a point, being white has 2.5 times that effect. And if anything, education has further strengthened, as we have controlled for additional confounders (eg, age and whiteness both correlate with education, but like income, both of those are conservative-leaning forces, and thus like income served to attenuate the apparent effect of education).
There is one additional variable in our dataset that probably plays a strong role in shaping ideology: one’s party affiliation. It is of course quite unclear whether party affiliation causes ideology, or vice versa, but there is strong evidence that one’s party affiliation tends to match one’s parents’, so it may be that party ID affects ideology rather than (as many think) vice versa. What happens if we include that in the regression?
mr3 <- lm(ideology_con ~ age + gender_male + race_white +
education + income + partyid_rep,data=anes_2008tr)
summary(mr3)
Call:
lm(formula = ideology_con ~ age + gender_male + race_white +
education + income + partyid_rep, data = anes_2008tr)
Residuals:
Min 1Q Median 3Q Max
-4.5688 -0.6376 0.0913 0.6234 3.8186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.952996 0.105709 27.935 < 2e-16 ***
age 0.007132 0.001291 5.524 3.69e-08 ***
gender_male 0.037217 0.047631 0.781 0.434662
race_white -0.191276 0.052315 -3.656 0.000262 ***
education -0.059686 0.015963 -3.739 0.000189 ***
income 0.024077 0.024092 0.999 0.317723
partyid_rep 0.324924 0.012876 25.236 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.127 on 2315 degrees of freedom
Multiple R-squared: 0.2409, Adjusted R-squared: 0.2389
F-statistic: 122.4 on 6 and 2315 DF, p-value: < 2.2e-16
The results illustrate the tricky nature of multiple regression. It’s easy to run it, but the interpretation is where knowledge of both statistics and substance are crucial. Gender has now lost its already-meager signficance, but more suprisingly, so has income! This suggests that income’s effects on ideology may have been going through party id: that is, increased income may not in fact make one directly more conservative, but instead just make one more Republican, which in turn makes one more conservative. Without party id, the effect of income appears strong, but with party id, it disappears altogether, although the rest of the variables remain similar. So again, the interpretation seems to be that income \(\rightarrow\) Republicanism \(\rightarrow\) ideology, so that if you control for people of equal Republicanism, the effect of changing income on ideology is non-existant. Only if that income makes you more Republican does it make you more conservative. (We will return to the precise interpretation of these causal pathways in the next module.)
Finally, note that the \(R^2\) has gone up consierably between this model and the previous one. That is because there is a very strong correlation between party id and ideology; so much so, that some people consider them two sides of the same coin, and would not want to include party id at all, since it is in some sense just another way of measuring the dependent variable. But that is a substantive debate.
In the next lesson, we will return to this more formally, and see the derivation of the coefficients and the other regression outputs.