Overview

This lesson introduces regression for categorical dependent variables.

Objectives

After completing this module, students should be able to:

Perform a logistic regression and interpret the results.

Reading

Lander, Chapter 17.1; Schumacker, Chapter 18.

1 Categorical dependent variables

In general so far, we have been modeling Y as a linear function of X: every 1-unit increase in $x_1$ produces a $\beta_1$ increase in y. But what if Y is a categorical variable, such as a binary? Did you complete high school; did you use drugs in the last year; did you pass a class; etc. These are all yes/no questions, and it usually doesn’t make sense to think about a continuous increase in Y with X (though we can also think of variant of Y that is continuous, such as years of high school).

But if we define Y as 1 or 0, while it doesn’t make sense to model $y = \beta_0 + \beta_1 x$, we can instead talk about the probability of y being 1 (eg, completing high school) conditional on X. This probability is continuous, which makes it better suited to regression, and our potential new model might be $P(Y=1|X) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon$. In this equation, a 1-unit increase in $x_1$ produces a $\beta_1$ increase in the probability that y will be 1. But there’s a problem…

Consider a model that looks at maternal employment status as a function of the number of children and the mother’s education level, where the dependent variable is a binary – employed full-time, or not. Again, we begin by considering Y as a probability – the probability of working full-time. But say each child decreases the probability of full-time work by 0.3. Then four children means a negative probability of working! That makes no sense. Similarly, if each year of education increase the probability by 0.1, if you have more than 10 years of education, the probability is more than 1, which is also impossible. We need to redefine our function such that y can never be below 0 or above 1 for any values of X on the right hand side.

1.1 Logistic function

The figure below shows the basic structure we want for our function: when X is sufficiently large, Y is almost certainly 1 (assuming $\beta$ is positive), and when X is sufficiently small (or negative), Y is almost certainly 0.

So how do we construct this function, which is very different from the straight best-fit lines we’ve been dealing with so far? We want some function f of the probability, $f(P(y=1))$, that allows the dependent variable to range between negative infinity and positive infinity, just as X can. This is similar to how we created new variables $x^2$ or $x_1 * x_2$ that are functions of existing variables, although as we will see, in estimating our final equation, we don’t actually have to manually redefine Y at all.

There are two steps in how we redefine $P(y=1)$. First, we work with the odds ratio, or just the odds, which is defined as $P(y=1)/P(y=0)$, which of course is also equal to $P(y=1)/(1-P(y=1))$, since y is either 0 or 1. If $P(y=1)$ is 0.75, then the odds ratio is $0.75/0.25 = 3$. If it’s a gamble and you have a 75% chance of winning and a 25% chance of losing, your odds are 3:1; if the odds double, they are now 6:1; if they halve, they are 1.5:1.

As you can see, as $P(y=1)$ gets closer to 1, the odds ratio goes to infinity, and as $P(y=1)$ gets closer to 0, this ratio goes to 0. So we’re half way there – we now have a dependent variable value that can range from 0 to infinity. But what about the other direction, from negative infinity to 0? Thus the second step is to take the log of the odds, or the log odds: the log of $P(y=1)/P(y=0)$ ranges from negative infinity (since the log of a number smaller than 1 like 0.0001 is large and negative) to positive infinity (since the log of a large number like 10000 is also a large number).

Thus our final equation relating Y to X is:

\[log(\frac{P(y=1)}{1-P(y=1)}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon\]

1.2 The intercept

\[log(\frac{P(y=1)}{1-P(y=1)}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon\]

This function $log(\frac{P(y=1)}{1-P(y=1)})$ is known as the logistic function, and is often abbreviated as logit$(P(y=1))$. But how do we conceptually relate this weird function of $P(y=1)$ to X and our $\beta$ values? It’s now a lot more complicated that just $y = \beta_0 + \beta_1 x ...$. What effect does X have on $P(y=1)$?

Well, the easiest thing we can say is that when everything on the right hand side sums to 0, we have logit$(P(y=1)) = 0$, which implies that $\frac{P(y=1)}{1-P(y=1)} = e^0 = 1$.

(The log here is the natural log, which is in base e = 2.718. The base-10 log of x, $log_{10}(x)$, answers the question, “10 to the what power equals x”, while the base-e log of x, $log(x)$ is “e to the what power equals x”. But anything to the 0 power, y^0, always equals 1. See here and here for more.)

So if $\frac{P(y=1)}{1-P(y=1)} = 1$, this means $P(y=1) = 1-P(y=1)$, which means that $P(y=1) = 0.5$. Thus the mid-point between when Y is equally probable to be 1 or 0 is when the right hand side sums to 0. When all the X values are 0, all we have left on the right is $\beta_0$ – thus $\beta_0$ determines the probabilty of Y when all the X are zero, just like the intercept in a multiple regression determines the value of Y when all the X are 0. If $\beta_0$ is 0, then Y is equally likely to be 0 or 1 when all the X are 0; if $\beta_0$ is negative, $P(y=1) < 0.5$ when all X = 0, and vice versa, if $\beta_0$ is positive, $P(y=1) > 0.5$ when all X = 0.

1.3 Logit

But that’s just the intercept. How does Y relate to X across all values of X?

Well, if our logistic function is:

\[log(\frac{P(y=1)}{1-P(y=1)}) = \beta_0 + \beta_1 x_1\]

then this is equivalent to

\[\frac{P(y=1)}{1-P(y=1)} = e^{\beta_0 + \beta_1 x_1}\]

by the definition of the log. And with a little basic algebra, we can rearrange the above to get:

\[P(y=1) = \frac{e^{\beta_0 + \beta_1 x_1}}{1+e^{\beta_0 + \beta_1 x_1}}\]

Now we have $P(y=1)$ as a function of x, so we can plot it and see what it looks like. This function of x on the right hand side is for obvious reasons known as the inverse logit and is sometimes written $P(y=1) = \textrm{invlogit}(x)$. The plot of it should look familiar:

library(ggplot2)
invlogit <- function(x){(2.718^x)/(1+2.718^x)}
ggplot(data=data.frame(x=c(-10, 10)), aes(x)) + stat_function(fun=invlogit) + ylab("P(y=1)")

This just what we wanted. As X increase, $P(y=1)$ goes from almost certainly 0 to almost certainly 1, and is at 0.5 when the right hand side is 0 – which since in our example here $\beta_0 = 0$, is when $x = 0$.

1.4 Interpreting $\beta$

So how do we interpret what $\beta$ means? Obviously as with the quadratic function, a 1-unit change in x will have different effects on y depending on what our x value is. Looking at the previous graph, when x is large, a 1-unit change in x produces a very slight change in y (because the slope is very flat), as is the case when x is very negative. In fact, the largest effect of x on y is when y is at 0.5, and the slope is steepest. So how can we interpret $\beta$ then?

The simplest method is just to calculate $\hat{P}(y=1)$ for a couple of specific x values and take the difference. For instance, we can ask what the effect of a 1-unit change in $x_1$ is from $\bar{x}_1$ to $\bar{x}_1 + 1$; or we can ask what the effect is of a change in $x_1$ from 1/2 a sd below $\bar{x}$ to 1/2 a sd above $\bar{x}$ (ie, a total change of 1 sd). And if x is a binary, we can also simply ask what the effect of a change in x is from 0 to 1. In all cases, we just calculate $\hat{P}(y=1)$ at the first x value, $\hat{P}(y=1)$ at the second, and take the difference to get the estimated effect of the change in x.

So for the logit, the estimated effect of a shift in x from $x=b$ to $x=a$ is:

\[\hat{P}(y=1|x=a) - \hat{P}(y=1|x=b) = \textrm{invlogit}(a) - \textrm{invlogit}(b) = \frac{e^{\beta_0 + \beta_1 a}}{1+e^{\beta_0 + \beta_1 a}} - \frac{e^{\beta_0 + \beta_1 b}}{1+e^{\beta_0 + \beta_1 b}}\]

In other words, this gives you the change in the probability that y = 1 (eg, the probability that the person will have graduated high school) given a shift in their x variable (eg, parental wealth) from $b$ to $a$ (eg, from $30,000 a year to $60,000 a year).

In the case of our diagram above, $\beta_0 = 0$ and $\beta_1 = 1$, so the effect of a change in x from (for example) 2 to 4 is:

invlogit(4) - invlogit(2)

[1] 0.1012312

Which as you can see from the previous diagram, matches the shift in $P(y=1)$ that you get in changing x from 2 to 4.

1.5 Example

Let’s look at another example. Say we are interested in the effect of the race of a defendant and a victim on whether the defendant gets the death sentence (y=1). All of our variables are binaries in this case, where $x_1 = 1$ if the defendent is white and $x_2 = 1$ if the victim is white.

So our model is logit$[P(y=1)] = \beta_0+ \beta_1 x_1 + \beta_2 x_2$. Say after running this logistic regression we get values of $\beta_0 = -3.596$, $\beta_1 = -0.868$ and $\beta_2 = 2.404$.

Thus if the defendent is white, that lowers their chance of getting the death penalty, but if the victim is white, that increases their chance. (Presumably there is also an interaction effect, where if the defendent is black and the victim is white, that boosts the odds of the death penalty by even more than the sum of the two terms, but we won’t get into that here.)

Specifically, if we want to know the probability of the death penalty if the defendant is black ($x_1=0$) and victim is white ($x_2=1$) that’s just:

\[\hat{P}(y=1) = \frac{e^{-3.596-0.868*0 + 2.404*1}}{1+e^{-3.596-0.868*0+2.404*1}}= 0.233\]

We can calculate $\hat{P}(y=1)$ again for $x_1=0$ and $x_2=0$ and take the difference to get the overall effect of a shift in the race of the victim from black to white (or vice versa).

1.6 Odds ratio effect

Another more analytic way to interpret $\beta_1$ is to return to the odds ratio formula:

\[\frac{P(y=1)}{1-P(y=1)} = e^{\beta_0 + \beta_1 x_1}\]

This can be rearranged as:

\[\frac{P(y=1)}{1-P(y=1)} = e^{\beta_0} (e^{\beta_1})^{x_1}\]

Looked at this way, each 1-unit increase in $x_1$ multiplies the right hand side by another factor of $e^{\beta_1}$ (in the same way that going from $a^3$ to $a^4$ is equivalent to multiplying by another factor of $a$). So each 1-unit change in $x_1$ increases the odds by $e^{\beta_1}$.

To return to our example, shifting $x_1$ from 0 to 1 changes the odds by $e^{\beta_1}$ or $e^{-0.868} = 0.42$. That means that a white defendant has less than half the odds of getting the death penalty than a black defendant, controlling for the race of the victim. Similarly, a white victim has $e^{2.404} = 11.1$ times the odds of triggering the death penalty.

1.7 Significance

We still haven’t gone into how we estimate the various $\beta$ coefficients, but let’s hold off on that for now. If we’re interested in the statistical significance of the effect of some $x$, we again look at the standard error on its $\beta$ and use it to test whether the $\beta$ is signficantly different from 0, in exactly the way we did for linear regression.

We can also construct a confidence interval around $\beta$, but sometimes we are also interested in the confidence interval not of $\beta$ (which by itself is hard to interpret), but for the odd-ratio effect.

Say the standard error for the victim variable is 0.601. Then the 95% CI (assuming a large n) is $2.404 \pm 1.96 * 0.601 = [1.23,3.58]$. But a more interpretable quantity is the 95% CI on the effect of X on the odds, which again is to multiply it by $e^{\beta}$. So in this case, the 95% CI is $[e^{1.23},e^{3.58}]$, or $[3.4,35.9]$. That is, our best guess is that the victim being white increases the odds of the defendent getting the death penalty by a factor of 11.1, but we are 95% sure it is between 3.4 and 35.9. Since multiplying the odds by 1 doesn’t change it, we can say the since the odds 95% range doesn’t include 1, we can reject the null that the effect of the victim’s race is insignificant. But we could also conclude significance from the fact that the raw $\beta$ 95% CI of $[1.23,3.58]$ doesn’t include 0 (just as we do for regular regression); and we can also just generate our test statistc of $2.404/0.601 = 4$ and conclude that that is well above the threshold of 1.96, etc.

1.8 Example 2

Here is one more example using our ANES dataset. We have an additional variable in this data set, which records whether the respondent voted in the last election (where of course 1 = voted). What’s the probability that someone will turn out given their various demographic characteristics?

anes_2008tr <- read.table("anes_2008tr.csv",sep=",",header=TRUE,stringsAsFactors=FALSE)
lr1 <- glm(voted ~ age + race_white + gender_male + education + 
             income + ideology_con + I(ideology_con^2) + partyid_rep,
           data=anes_2008tr,family="binomial")
summary(lr1)


Call:
glm(formula = voted ~ age + race_white + gender_male + education + 
    income + ideology_con + I(ideology_con^2) + partyid_rep, 
    family = "binomial", data = anes_2008tr)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6808  -1.1149   0.6252   0.8553   1.7542  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -0.23447    0.43286  -0.542 0.588041    
age                0.01934    0.00265   7.299 2.90e-13 ***
race_white        -0.02513    0.10602  -0.237 0.812663    
gender_male       -0.26983    0.09613  -2.807 0.005001 ** 
education          0.34633    0.03407  10.166  < 2e-16 ***
income             0.17651    0.04872   3.623 0.000292 ***
ideology_con      -0.68102    0.17808  -3.824 0.000131 ***
I(ideology_con^2)  0.08314    0.02107   3.946 7.94e-05 ***
partyid_rep       -0.08321    0.02954  -2.817 0.004849 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2873.8  on 2321  degrees of freedom
Residual deviance: 2618.7  on 2313  degrees of freedom
AIC: 2636.7

Number of Fisher Scoring iterations: 4

glm stands for general linear model, and is actually a broad family of models that allow for various transformations of y or x. In this case, the “binomial” option lets it know that the dependent variable is binary and the transformation is logistic. How the coefficients are estimated from that is beyond the scope of what we will cover here, but you can read more about generalized linear models here or here.

Looking at the coefficients, we can see that most are significant and in the directions we might expect: age increases the likelihood of voting, as do education and income, while being male, being conservative, or being Republican all decrease it. Note that I’ve also incleded an ideology^2 term, which is positive, indicating that more extreme ideologues (of either left or right) are more likely to vote than centrists.

We can more easily interpret the coefficients themselves by exponentiating them ($e^{\beta}$) to get their effect on the odds:

exp(lr1$coef)

      (Intercept)               age        race_white 
        0.7909889         1.0195320         0.9751866 
      gender_male         education            income 
        0.7635055         1.4138630         1.1930495 
     ideology_con I(ideology_con^2)       partyid_rep 
        0.5060996         1.0866958         0.9201625

As usual, interpreting this requires knowing the units for each x, but age increases the odds of voting by about 2% for each year, versus education increasing it by 41% for each level (although remember that there are many more years than educational levels!).

2 More than 2 responses

In this lesson we’ve mainly been focusing on dependent variables that are binary. But what if there are more than 2 categories? There are two different ways this could be: first, if the dependent variable is ordinal (ordered categories), and second if it is nominal (unordered categories).

In both cases, the set-up is much like we had with categorical independent variables: it’s all relative to some baseline category.

If the dependent variable is nominal (such as religion) where there is no order among the options, we just have to choose one as our baseline and compare the others relative to it. If we have $m$ categories, we essentially have $m-1$ separate logistic regressions. If we have three categories, for instance, we have one logit where our dependent variable is $log(P(y=a)/P(y=c))$, and another one where our depdendent variable is $log(P(y=b)/P(y=c))$. Note how in both cases, the reference category is “c” and we are interested in what affects the odds of being a or b relative to being c. But effectively, we have two separate logistic regressions here, one for each pairing, each with separate $\beta$ coefficients.

Conversely, if we have an ordinal dependendent variable (such as education level from 1 to 4), we have one combined regression but $m-1$ separate intercept values $\beta_{0m}$. We’re still interested in the log odds, but now it’s the odds of being in category 1 vs all the rest; category 1 & 2 vs all the rest; category 1-3 vs all the rest; etc. We still chop things up into separate models, but because the dependent variable is ordered, these aren’t totally separate logistic regressions: instead, we are interested in the general question “what’s the chance of having education level $m$ or less relative to having more than $m$.” The overall function is $log(P(y \leq m )/P(y > m)) = \beta_{0m} + \beta_1 x_1 + \beta_2 x_2 + ...$, where each $\beta_{0m}$ is a different intercept for each level. Conceptually, the idea is that X increases your odds of having a higher education level, but you need less X to transition from education level 2 to 3 than from 3 to 4. Each intercept is the probability of having $m$ or less education given all the X values being 0; thus each $\beta_{0m}$ is larger for smaller levels if $m$ (ie, your baseline probility (the intercept) for having high school or less is higher than the probabilty of having college or less).

So when we run a nominal logit, we get a complete set of $\beta$ coefficients for all the $m-1$ categories. When we run an ordinal logit, we get one shared set of $\beta_x$ values plus $m-1$ $\beta_0$ values. Interpretation becomes a bit harder though, especially for the nominal logit, when X might be significant for some DV category pairs but not others. But we won’t go into the details of that here.

Computational Statistics 10.2: Categorical Dependent Variables