Overview

This lesson introduces multiple regression.

Objectives

After completing this module, students should be able to:

Perform a multiple regression and interpret the main results.
Explain the meaning of controlling for a variable, and contrast a multiple regression with a bivariate regression.

Reading

Lander, Chapter 16; Schumacker, Chapter 17 through p. 402.

1 Multiple independent variables

In the previous module we introduced bivariate regression for modeling the effect of a single independent variable X on our dependent variable Y. But what do we mean by “the effect”? Most fundamentally, we are interested in causality – what is the effect of changing X on Y? This is both a scientific question – how are they causally related – and a policy question: how can we change outcome Y by changing policy/action X? If X doesn’t actually cause Y, and instead they are (for instance) both causes by some third factor Z, then we might find a nice relationship between X and Y, but if we actually intervened and boosted X, Y wouldn’t change.

For instance, imagine we regress income on education level in a large, nationally-representative sample, and find a statistically signficant positive relationship. We might conclude that education boosts income, and from a policy point of view, you might conclude that if you go back to school and get a couple more years of education, your income will go up. But what if it’s the case that some third factor instead affects both? For instance, it may be that harder-working people both get more education, and work their way up in any subsequent jobs to get higher pay. Or perhaps people with wealthy parents both can afford more education, and have the social connections to get fancier jobs. In both of these cases, we might see a strong correlation and significant \(\beta\) relating education to income, but we might find that, had we invested the extra years in education, our income would be no higher than our peers with equal work ethics and similarly wealthy parents. That is, the education (in this scenario) didn’t in fact cause the income boost, but rather it just gave the appearance of it, due to the fact that education was also tracking the real causes, hard work and parental income.

This is why multiple regression is so important. With multiple regression, we can control for these other variables, to get the real effect of education on income (for instance). From a policy or causal point of view, we want to know what happens to our dependent variable (income) if we change just a single independent variable (education), leaving all the others constant. In the real world, these variables are not independent – people with higher education are also those who work harder and have wealthier parents – but we can leverage what little variation there is in the data to try to tease out the effect of the variable we care about (education) from the variables we don’t care about or can’t change (parental income; work ethic). This is what multiple regression does.

1.1 Regression coefficients

The multiple regression model is fundamentally very like the bivariate regression model:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \epsilon\]

That is, we assume our dependent variable is a linear function of a bunch of different independent variables. If you boost one of those independent variables by 1 unit (in whatever units that variable is measured with), then Y will increase by \(\beta_i\). Eg:

\[(\beta_0 + \beta_1 (x_1+1) + \beta_2 x_2) - (\beta_0 + \beta_1 x_1 + \beta_2 x_2) = \beta_1\]

Here, going from a value of \(x_1\) to \(x_1 + 1\), we see that our predicted \(\hat{y}\) has increased by \(\beta_1\). This is just saying that \(\beta_1\) is not, in the multiple regression case, like the simple correlation between \(X_1\) and \(Y\), but rather the like the partial correlation between \(X_1\) and \(Y\). The partial correlation is the correlation between \(X_1\) and \(Y\) when you hold all the other variables constant; similarly, \(\beta_1\) is the effect of \(X_1\) on \(Y\) when you hold all the other variables constant.

We’ll turn to the derivation of these coefficients in a moment, but let’s first consider a real-world example in some detail.

2 Example

Let’s consider some real-world data. Say we are interested in the effect of education on ideology. That is, we want to know the effect of education holding all the other possible confounding variables constant. Ideally, we would do an experiment: take a sample of people, randomly give half of them an extra year of education, and then see what the ideological differences are between the two groups afterwards. Obviously such an approach is often impossible in social research. So instead we turn to surveys.

The American National Election Studies survey is a large, biennial political survey that asks a representative sample of Americans a large set of questions about the background and political views. It can be downloaded here or here once you register, and there are a number of formats it can be downloaded in, along with extensive “codebooks” describing the data collection and coding procedures. Even though a huge amount of work has gone into the dataset to prepare it, it nevertheless requires a large amount of work by the researcher to prepare it for their own analysis. Let’s take this data and prepare it for a multiple regression analysis of the effect of education on ideology.

2.1 Understanding the data

To use a dataset we need to know what’s in it first. The ANES, like many datasets you will find, has unhelpful variables names, mainly because there are nearly 1000 of them in the full dataset. To figure out which ones I want I consulted the codebook that comes with the dataset, which explains each variable and what its numeric codes stand for. For instance, for education there are actually a number of different questions in the dataset that cover education, using a variety of different sub-categories. I found and chose to use the one with the most categories, VCF0140a, as explained in the codebook:

VCF0140a  R Education 7-category
=============================================================================
QUESTION: 
--------- 
1952-1972: How many grades of school did you finish? 
1974 AND LATER: What is highest grade of school or year of college you have completed? Did you get a high school diploma or pass a high school equivalency test? 
1974,1976: Do you have a college degree? (IF YES:) What degree is that? 
1978-1984: Do you have a college degree? (IF YES:) What is the highest degree that you have earned? 
1986 AND LATER: What is the highest degree that you have earned?

VALID_CODES: 
------------ 
1. 8 grades or less ('grade school')
2. 9-12 grades ('high school'), no diploma/equivalency 
3. 12 grades, diploma or equivalency 
4. 12 grades, diploma or equivalency plus non-academic
training 
5. Some college, no degree; junior/community college
level degree (AA degree) 
6. BA level degrees 
7. Advanced degrees incl. LLB

MISSING_CODES: 
-------------- 
8. DK 
9. NA; RF; no Pre IW; short-form 'new' Cross Section (1992) Inap. question not used

Note the presence of the “DK” (don’t know) and NA (not available) categories. These things are annoying but important to watch out for. As you can see, the education level is increasing in the code number from 1-7, but 8 and 9 are DK and NA; if we just threw the data in as coded, we would be mixing in the effects of going up one unit, eg from Some College to BA, with the effect of going “up” from Advanced Degrees (7) to Don’t Know (8). Obviously the latter would be a mistake, since we don’t think the DK’s are actually more educated than the Advanced Degrees. So we need to either drop those 8 and 9’s, or recode them as the mean value, which shouldn’t bias our outcomes too much.

As another potentially confusing example, how is income coded here? Rather than raw dollars, it turns out that for comparison across years (where inflation changes things) ANES has coded it by percentiles:

VALID_CODES:
------------ 
1. 0 to 16 percentile 
2. 17 to 33 percentile 
3. 34 to 67 percentile 
4. 68 to 95 percentile 
5. 96 to 100 percentile

So in interpreting what a 1-unit change means (ie, what \(\beta_{income}\) means), we have to bear in mind that it is a change in the percentile bracket of the respondent, not raw income.

We need to do this for each variable we want to keep and analyze, and there are no shortcuts. We read through the codebook (or used “find” to speed things up) to find the variables that cover aspects that we are interested in – education, ideology, demographics like age or income, etc – and then we have to read the dataset into R, select the variables we want, and recode and clean them in order to get rid of stuff like “DK” and to make sure we understand the units that each variable is measured in.

2.2 Reading the dataset

There are many formats of this data available on the website, but let’s use the simplest version of the survey dataset, in raw text format, ANES-Data.tsv from here. To even know how each cell in the data is separated requires reading the codebook or website, glancing at the raw text using a text editor, or looking at the suffix to the file, which is ANES-Data.tsv or “tab-separated values.” Once we know that, we can read the data in. It’s not guaranteed that there are column names (a header), so again, looking at the codebook or glancing at the data first using a text editor or the like can be useful. It is essential to included “header=TRUE” if there is a header, though, since otherwise every column will be encoded as texts to match the text of the first row, the column names. And as before, we use “stringsAsFactors=FALSE” just to make it easier to re-encode data as needed later.

anes_all <-  read.table("ANES-1948-2008/ANES-Data.tsv",sep="\t",header=TRUE,stringsAsFactors=FALSE)

2.3 Subset data

Now, in this example, I am interested in examining 2008 data only, and I know from the website and codebook that this dataset encompasses all surveys from 1948 to 2008. Since the variable names in this dataset are extremely non-self-explanatory, we have to look into the codebook to even know what variable encodes the year of survey, which is called VCF0004. So we subset that year:

anes_2008 <- anes_all[anes_all$VCF0004==2008,]

Now for this year, I’m interested in analyzing just a few variables, which I already know I’m interested in because I have substantive knowledge about what I’m studying. For the basic demographics, I’m looking for age, race, gender, education, income; and I also want party ID, whether they voted in 2008, and who they voted for. So I laboriously look through the codebook, which in this case is 388 pages, noting that some of these demographic features are represented in multiple variables, eg listing party ID by four groups (Democrat, Republican, Independent, Other) or via a spectrum (eg, Strong Republican, etc). I pick out the ones I want, and create a new dataset with just those. Note how we create variable names that include not just the general name (eg “gender”), but also give a hint of which way the variable will be encoded when that is ambiguous (eg, “gender_male” will indicate that 0 is non-male and 1 is male).

anes_2008tr <- data.frame(age=anes_2008$VCF0101,
                          race_white=anes_2008$VCF0106,
                          gender_male=anes_2008$VCF0104,
                          education=anes_2008$VCF0140a,
                          income=anes_2008$VCF0114,
                          ideology_con=anes_2008$VCF0803,
                          partyid_rep=anes_2008$VCF0301,
                          vote_rep=anes_2008$VCF0704a,
                          voted=anes_2008$VCF0702)

2.4 Recode data

Finally comes the hard part. For each of these variables, this particular codebook does a good job of describing what each number means, but for each one, a different number might mean NA (“not available”) or DK (“don’t know”), so we have to go through them one at a time. In most cases, one wants to preserve NA values for subsequent analysis, because missing data is meaningful and should never just be ignored or papered over. But in this case, I’d like to recode (“impute”) missing data elements to the median, mean, or mode of each variable. If it’s a binary or categorical variable, I want to use the mode; if it’s a continuous variable, I’ll use the median, which is often more robust than the mean, eg if there are a few outliers it won’t be biased by that. It’s also safer because sometimes you have a variable that is just integer values that you mistakening believe to be continuous, but if you use the median, your imputed value for the missing element will also be an integer (assuming your “median” function flips a coin if there are an even number of elements and the median is between two different values; this can be set within R, of course).

There are nice functions for imputing, such as imputeMissings, whose impute function does a good job of automatically using the median or mode as needed. But if possible, it’s always better to do it yourself to make sure it’s done right. That means that for each variable, we have to think how it should be be encoded for subsequent analysis. For instance, race is a categorical variable with many different possibilities. Do we want a separate, “dummy” (0/1) variable for each possibility? In this case, we will just create a single variable for whether or not the respondent is the modal category, White. Education and Income are both coded as a set of integer levels which require the codebook to understand. How should we encode the DK or NA values here? I choose the median, which is probably inaccurate (non-response is more likely to come from respondents with less than average education), but is both rare, and unlikely to bias our data too much. And so forth. Each line below recodes a new variable, each via debatable rules. The advantage of doing it this way is that one gets to know the dataset very well before moving on to analysis.

anes_2008tr$race_white[anes_2008tr$race_white != 1] <- 0 # non-race_white to 0 (race_white=1)
anes_2008tr$gender_male[anes_2008tr$gender_male == 2] <- 0 # fegender_male to 0 (gender_male=1)
anes_2008tr$education[anes_2008tr$education > 7] <- 3 # DK/NA to middle
anes_2008tr$income[anes_2008tr$income == 0] <- 3 # DK/NA to middle
anes_2008tr$ideology_con[anes_2008tr$ideology_con == 9] <- 4 # DK to middle
anes_2008tr$ideology_con[anes_2008tr$ideology_con == 0] <- 4 # NA to middle
anes_2008tr$partyid_rep[anes_2008tr$partyid_rep == 0] <- 4 # DK/NA to middle
anes_2008tr$vote_rep <- anes_2008tr$vote_rep - 1 # make binary
anes_2008tr$vote_rep[anes_2008tr$vote_rep == -1] <- NA # DK/NA to no
anes_2008tr$voted <- anes_2008tr$voted - 1 # make binary
anes_2008tr$voted[anes_2008tr$voted == -1] <- 0 # DK/NA to no

In this particular dataset, I don’t preserve any categorical variables with more than 2 categories, but one often does have a variable (such as religion) that should not be collapsed down to two categories. In that case, one would use as.factor() to recode the variable from a string back into a factor. One might then want to recode or rename the factors in various ways (eg, changing a “1” to “Catholic”, and so on), which can be complex. A good quick introduction to how to do this is at the Cookbook for R here and here.

2.5 Examine for outliers

Now that our data is partially recoded, it is always worth examining it again with summary(). It is also worth examining the data visually, using histograms of each data, to see whether there are any “outliers”, or data elements that seem much larger or smaller than the rest of the elements in that variable. If so, it is always a judgement call about what do. If it looks like something miscoded, or due to an entirely different mechanism (eg, a dataset of dogs with one element in the “height” variable being 5 feet, which is either a miscode, or a Great Dane that might be very unusual for a dog), one might want to drop it (set it to NA or drop that row) or recode it in some other way. That is a substantive question though, and depends entirely on the meaning of the data.

2.6 Save

Finally, we want to save our data before using it in later analysis. In this case I’ll save it in a general text format instead of RData, so it can be used by others more easily if I want to share it.

write.table(anes_2008tr,"anes_2008tr.csv",sep=",",row.names=FALSE)

3 Multiple regression 1

Let’s reload the data and return to our original question: the effect of education on ideology.

anes_2008tr <- read.table("anes_2008tr.csv",sep=",",header=TRUE,stringsAsFactors=FALSE)

We begin with a bivariate regression. (We use the summary command to summarize the lm (linear model) output because summary gives more information in a more readable format.)

bv1 <- lm(ideology_con ~ education,data=anes_2008tr)
summary(bv1)


Call:
lm(formula = ideology_con ~ education, data = anes_2008tr)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2079 -0.1727 -0.1374  0.8626  3.0034 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.24309    0.07358  57.665   <2e-16 ***
education   -0.03522    0.01661  -2.121   0.0341 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.291 on 2320 degrees of freedom
Multiple R-squared:  0.001935,  Adjusted R-squared:  0.001504 
F-statistic: 4.497 on 1 and 2320 DF,  p-value: 0.03406

The result is that education has a negative effect, but just barely (p = 0.0341), and the effect is small. Increasing 1 unit in education (ie, from high-school to college, or college to a higher degree) causes ideology_con to decrease by 0.035 units. Since ideology is a 7-point scale from Strongly Liberal to Strongly Conservative, this is a tiny shift leftward (negative, as coded here) for each large step up in education. Furthermore, the \(R^2\) is very small, at 0.002; very little of the variation in ideology is explained by education.

3.1 Multiple regression 2

But what about our previous concern, that education may not be having any effect at all, and may be just along for the ride, as it were? Perhaps income is affecting both education and ideology – the more income you have, the higher your education, and the more liberal or conservative your ideology? So let’s add ideology into the model:

mr1 <- lm(ideology_con ~ education + income,data=anes_2008tr)
summary(mr1)


Call:
lm(formula = ideology_con ~ education + income, data = anes_2008tr)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3406 -0.3224 -0.0764  0.7915  3.2356 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.03765    0.08588  47.015  < 2e-16 ***
education   -0.06604    0.01785  -3.700 0.000221 ***
income       0.12299    0.02682   4.585 4.77e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.285 on 2319 degrees of freedom
Multiple R-squared:  0.0109,    Adjusted R-squared:  0.01005 
F-statistic: 12.78 on 2 and 2319 DF,  p-value: 3.019e-06

Notice first that income appears to have a positive effect: ie, the more income you make, the more conservative you are. And this effect appears to be a bit higher than education, though it is very tricky to compare the relative effects of different variables, since they are measured in different units. (There are methods designed to do this, where you rescale every variable to have a mean of 0 and a sd of 1 and then run the regression, producing “standardized betas” whose sizes can be directly compared with each other. But this has its own problems, and is rarely done these days. Generally we try not to say which variable has the most impact, and just talk about them individually; or if we are interested in policy interventions, we can still choose which one would give us the most leverage over the dependent variable given whatever our finite resources are.)

More interestingly, note that education now has a stronger effect than before – almost twice as large, with a much lower p-value. Why is this? Presumably it is because education independent of income has a strong(ish) liberal-izing effect, but of course in practice, people with higher education levels also have higher incomes and income independent of education has a conservative-izing effect. So if you just look at the relationship between education and ideology, you find a weak or non-existent effect, because people with higher educations (more liberal) also have higher incomes (more conservative), and those two things almost cancel out. But if you control for income, you are looking at the effect of education between people of the same income, which is to say, you are determining the effect of education independent of income.

Of course, in real life, education and income go together for most people, so the varation in the data – those people with high education and low income or vice versa – reflects precisely those people who are a bit unusual. So substantively, we should always be a bit wary of over-interpreting these things, if they are being based on odd-balls. But that’s often the best we can do, and it does illuminate the key difference between the bivariate effect and the multiple-regression effect controlling for other variables.

3.2 Multiple regression 3

Of course, income is only one of the possible confounding variables that might be affecting both education (our independent variable of interest) and ideology (our dependent variable). What other factors might play a role in ideology? Some of the basic demographics certainly affect both:

mr2 <- lm(ideology_con ~ age + gender_male + race_white + 
                          education + income,data=anes_2008tr)
summary(mr2)


Call:
lm(formula = ideology_con ~ age + gender_male + race_white + 
    education + income, data = anes_2008tr)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5159 -0.4137 -0.0326  0.6346  3.3294 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.680935   0.114812  32.061  < 2e-16 ***
age          0.005543   0.001456   3.807 0.000144 ***
gender_male  0.111990   0.053669   2.087 0.037026 *  
race_white   0.277814   0.055208   5.032 5.22e-07 ***
education   -0.073291   0.018011  -4.069 4.88e-05 ***
income       0.101125   0.026980   3.748 0.000182 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.272 on 2316 degrees of freedom
Multiple R-squared:  0.03202,   Adjusted R-squared:  0.02993 
F-statistic: 15.32 on 5 and 2316 DF,  p-value: 7.635e-15

As expected, the demographic variables also affect ideology. Increasing age makes one more conservative (eg, 10 years boosts your score by 0.055; make sure you understand why!); being male has a barely significant effect in the conservative direction; being white, on the other hand, has a much stronger effect. We can compare the latter two directly since both are binary and thus in the same units: while being male increases your conservatism by a tenth of a point, being white has 2.5 times that effect. And if anything, education has further strengthened, as we have controlled for additional confounders (eg, age and whiteness both correlate with education, but like income, both of those are conservative-leaning forces, and thus like income served to attenuate the apparent effect of education).

3.3 Multiple regression 4

There is one additional variable in our dataset that probably plays a strong role in shaping ideology: one’s party affiliation. It is of course quite unclear whether party affiliation causes ideology, or vice versa, but there is strong evidence that one’s party affiliation tends to match one’s parents’, so it may be that party ID affects ideology rather than (as many think) vice versa. What happens if we include that in the regression?

mr3 <- lm(ideology_con ~ age + gender_male + race_white + 
                          education + income + partyid_rep,data=anes_2008tr)
summary(mr3)


Call:
lm(formula = ideology_con ~ age + gender_male + race_white + 
    education + income + partyid_rep, data = anes_2008tr)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5688 -0.6376  0.0913  0.6234  3.8186 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.952996   0.105709  27.935  < 2e-16 ***
age          0.007132   0.001291   5.524 3.69e-08 ***
gender_male  0.037217   0.047631   0.781 0.434662    
race_white  -0.191276   0.052315  -3.656 0.000262 ***
education   -0.059686   0.015963  -3.739 0.000189 ***
income       0.024077   0.024092   0.999 0.317723    
partyid_rep  0.324924   0.012876  25.236  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.127 on 2315 degrees of freedom
Multiple R-squared:  0.2409,    Adjusted R-squared:  0.2389 
F-statistic: 122.4 on 6 and 2315 DF,  p-value: < 2.2e-16

The results illustrate the tricky nature of multiple regression. It’s easy to run it, but the interpretation is where knowledge of both statistics and substance are crucial. Gender has now lost its already-meager signficance, but more suprisingly, so has income! This suggests that income’s effects on ideology may have been going through party id: that is, increased income may not in fact make one directly more conservative, but instead just make one more Republican, which in turn makes one more conservative. Without party id, the effect of income appears strong, but with party id, it disappears altogether, although the rest of the variables remain similar. So again, the interpretation seems to be that income \(\rightarrow\) Republicanism \(\rightarrow\) ideology, so that if you control for people of equal Republicanism, the effect of changing income on ideology is non-existant. Only if that income makes you more Republican does it make you more conservative. (We will return to the precise interpretation of these causal pathways in the next module.)

Finally, note that the \(R^2\) has gone up consierably between this model and the previous one. That is because there is a very strong correlation between party id and ideology; so much so, that some people consider them two sides of the same coin, and would not want to include party id at all, since it is in some sense just another way of measuring the dependent variable. But that is a substantive debate.

In the next lesson, we will return to this more formally, and see the derivation of the coefficients and the other regression outputs.

Computational Statistics 8.1: Multiple regression