Overview
This lesson introduces the chi-square test.
Objectives
After completing this module, students should be able to:
Reading
Schumacker, Chapter 11.
So far we have examined a sequence of progressively more complex tests:
The latter two, in particular, are central to the experimental method, eg for verifying that a treatment actually has a effect.
But often we wish to go beyond the simple comparison of groups to examine the relationship between multiple variables. This is especially important with observational data, where we can’t (for practical or moral reasons) directly manipulate X in order to see whether it has an effect on Y. For instance, is there a connection between income and political affiliation? If so, which way does the correlation go (does increased wealth correlated with increased liberalism or conservatism)? And if there is a correlation, is it causal, that is, does income cause people to become more conservative, or does (for instance) increased age or some other factor cause both?
These are the sorts of questions that are often at the heart of the social and other sciences. And like this example, it is often difficult or impossible to create an experiment to test the causal question directly (how would you do it in this case?), and we must instead rely on observational data.
We will pursue this chain – from independence to correlation to causation – over the next few modules. But the most basic question is whether two variables are independent of each other, and the most straightforward and common test of this is the chi-square test.
The chi-square test is kind of like the dependent-sample t test, inasmuch as we must have multiple measures of the same thing. Consider a simpler version of the question posed above: are gender and political party independent of each other, or are they somehow correlated or dependent? Here we must have at least two measures for each person: their gender, and their party affiliation. We’re no longer looking at two more more groups and asking whether (for instance) their heights are different. Instead, we are no interested in whether two different aspects of the same person or object or (more generally) “observation” are connected to each other. When we turn to multiple regression, we might ask what the causal relations are between a large number of variables – age, income, party, gender, etc – but the most basic approach is looking at whether any two are connected; and the most basic way to do that is with categorical (as opposed to numerical) data where we just have a finite number of categories. Eg, gender (male and female) and party (Democrat, Republican, Independent). (Obviously in both cases the reality is more complex and variable, but we will use this simplification for now.)
So this is the simplest question we could possibly ask about two potentially-connected variables: are they independent, or not? There’s no issue of how they are connected, and we aren’t even dealing with continuous numbers. But nevertheless, this test is quite powerful for at least showing there is some dependence between the two.
For the chi-square test then, the null and research hypotheses are usually:
\(H_0\): The variables are independent of each other.
\(H_1\): The variables are not independent (ie, they are dependent).
What do we mean by dependent? Basically we mean the same thing as we meant in the probability section: A and B are independent if \(P(A \& B) = P(A)P(B)\). That is, knowing that B has occurred gives you no information about A: there is no interaction between these two processes, and nothing affecting them both. Two dice are independent when the roll of the first tells you nothing about how the second will come up. They are not independent if someone has loaded them such that they always come up with 7 total: if you know how the first came up, you of course now know how the second will.
We can imagine an even simpler version, with two coins: now we have four outcomes (HH, HT, TH, TT), and we might wonder whether the two coins are independent or not. If we arrange the outcomes into a 2x2 table (H/T on the right for coin 1 and H/T along the top for coin 2), repeatedly flip the pair of coins, and put a check in the appropriate box depending on the outcome, after a large number of repeated flips, we would expect to see about the same number of check marks in each box. If we thought the coins were fixed (eg, to come up HH more often) we might expect to see more in that box than the rest, but of course this is a random process, so we would need to ask whether, if we did see more in the HH box, it was really due to cheating rather than just random chance. How do we test that? The chi-square test.
We often encounter a similar situation in observational data. Say we ask a bunch of people two questions: their gender, and their political affiliation. For each person we have two variables, and we create the following table of counts, where each number corresponds to the number of people in that category (eg, female Republicans):
If there is no relationship between gender and politics, we might expect to see an equal number of people in each cell. But that’s not quite right: maybe there are more Democrats than Republicans in our survey. In that case, we would of course expect to see equal numbers of male and female Democrats, but more of each than male or female Republicans. But that’s not quite right either: perhaps there are also more females in our survey. Then we would expect to see more female Democrats than male Democrats, and the same for Republicans, but also more female Democrats than female Republicans, and the same for males.
Specifically, if in the overall sample there are, say, 60% women and 40% men, and 70% Democrats vs 30% Republicans, then if these two things are independent (like separate coin flips), we would expect to see \(0.60*0.70\) percent female Democrats, \(0.40*0.70\) percent male Democrats, \(0.60*0.30\) percent female Republicans and \(0.40*0.30\) percent male Republicans. (And of course to go from percentages to total numbers, we’d multiply by the total number of people in our survey.) So this is what we would expec to see if the two things were totally independent of each other. If on the other hand they are not independent – if for instance being female makes it more likely you’re a Democrat – then we would see different counts or percentages than we expected. But how different is enough to show it’s not just random?
So how do we formalize this into a test? We want a single number that summarizes how far off the counts in each cell are from what we would expect if the two variables are independent. Once we’ve calculated this number (a test statistic), we then want to be able to say, well, how unlikely was it to get a number of that size? And that’s just like the t test or the F test. Something like the central limit theorem tells us how we would expect this test statistic to be distributed around the truth (just like the means of samples are distributed around the truth if we took lots of separate samples), and based on this distribution, we can say how unlikely the draw (test statistic) we got was, and thus whether the null was likely to be true (ie, whether the two variables are independent of each other).
So before turning to the statistical test, let’s first work on measuring in one number how far the counts in all the cells differ from what we would expect to see if the two variables were entirely independent.
So what would we expect to see in each cell if gender and party ID are independent?
Percent female: 1511 / 2771 = 0.545
Percent male: 1260 / 2771 = 0.455
Percent Dem: 959 / 2771 = 0.346
Percent Indep: 991 / 2771 = 0.358
Percent Rep: 821 / 2771 = 0.296
If being Dem and being female are independent of each other, then
\(p(Dem \& F) = p(Dem) * p(F) = 0.346*0.545 = 0.189\)
Thus the total number of female Democrats we would expect to see (the total in that cell) would be \(0.189*2771 = 522.9\)
We can do the exact same calculation for each cell, and the expected totals we get are:
So how far is what we actually observed from what we would have expected to see if the variables were independent of each other? Well, basically we just add up the differences between the observed count and the expected count for each cell, and that’s our number. Of course, it’s not quite so simple: like the t test, where the difference between the observed mean and the null hypothesis is scaled by dividing by the standard error, we have to do a bit more to get our total number into the right scale before we can talk about the statistics.
In particular, our test statistic, called the chi-squared (or sometimes chi-square) statistic, is:
\[\chi^{2}= \sum \frac{(f_{o}-f_{e})^{2}}{f_{e}}\]
Where \(f_{o} =\) observed number in a cell and \(f_{e} =\) expected number in a cell, and the summation is over all the cells. That is, for each cell, we take the difference between what we observe and what we would expect to observe if the two variables were independent, square that difference, and take it as fraction of the expected total; and then we just add them all up.
So in this case, out statistic is:
\(\chi^{2}= \sum \frac{(f_{o}-f_{e})^{2}}{f_{e}} = \frac{(573-522.9)^{2}}{522.9} + \frac{(516-540.4)^{2}}{540.4} + ... + \frac{(399 - 373.3)^{2} }{373.3} = 16.2\)
But is that a big number? As with the t test statistic, we need to know the underlying distribution to answer that…
Like the uniform, binomial, poisson, normal, t, and F distributions, the chi-square is just another distribution. Whereas the normal and T distributions, for instance, deal with sample statistics such as means, the chi-squared distribution characterizes the sum of squared normal statistics. If you look back at the \(\chi^{2}\) equation, you see that there are squared terms in the numerator, and of course the denominator is also positive (being a count), so the \(\chi^{2}\) is always positive, and thus can’t be normal. If we instead square a bunch of normal samples and add them up, we get a distribution that looks like the \(\chi^{2}\):
z1 <- rnorm(1000,2,5)
z2 <- rnorm(1000,5,3)
z3 <- rnorm(1000,7,7)
zsq_tot <- z1^2 + z2^2 + z3^2
hist(zsq_tot,breaks=30)
As usual, the \(\chi^{2}\) distribution also has a shape parameter, which like the t corresponds to the degrees of freedom of the system. The degrees of freedom for the \(\chi^{2}\) is similar to the t, being close to the \(n\) of the system, but in this case \(n\) is not proportional to the number of samples, but to the number of cells in the table. Like the t, we again have a minus-1, so instead of being \(\#rows*\#columns\) (ie, the number of cells), the degrees of freedom is
\(df = (r-1)(c-1)\), where \(r =\) number of rows, \(c =\) number of columns.
The various shapes of the \(\chi^{2}\) depending on the df are:
As usual, we consider our test statistic a draw from the \(\chi^{2}\) distribution (with the appropriate degrees of freedom), and the farther out it is, the less likely it is. The \(\chi^{2}\) test is fundamentally one-tailed: we are only interested in whether the statistics is larger than we would expect if the variables were independent, and of course it can’t be negative due to squaring the differences. And once again, if it falls into the rejection region – eg, the region of the right tail of the distribution that accounts for \(\leq 0.05\) of the total – then we know that that number was unlikely to be that large just by chance alone. We can also, equivalently, calculate the exact p-value (ie, the area to the right of our test statistic), and reject the null (that the variables are independent) if the exact p-value is less than 0.05.
To return to our example, the df is \((r-1)(c-1) = (2-1)(3-1) = 2\), and our test statistic was 16.2.
Our 95% threshold value is thus
qchisq(.95, df=2)
[1] 5.991465
Our test statistic is clearly much larger (16.2 > 5.99), so we reject the null that these two variables (gender and political affiliation) are independent. We could similarly calculate the p-value directly and likewise reject the null:
1-pchisq(16.2, df=2)
[1] 0.0003035391
Although again with modern computation, we don’t really need to use tables any more, we could also determine the test threshold value using the \(\chi^{2}\) table:
As usual, we find the \(df\) on the right, and look for the \(\alpha\) level along the top. Eg, for an \(\alpha\) of 0.05, we look under \(\chi^{2}_{0.050}\), and once again we see our threshold value of 5.991.
So what did we learn in this example? That gender and party are not indepenedent. But can we say anything more than that? One simple thing we can do is go back to our table showing the expected vs the observed frequencies, and calculate the signed score for each cell, which is just \(\frac{(f_{o}-f_{e})^{2}}{f_{e}}\) but negative / positive where \(f_{o}-f_{e}\) is negative or positive.
Thus the score in the top left would be \(\frac{(573-522.9)^{2}}{522.9} = 4.8\) and the top middle would be \(-1 * \frac{(516-540.4)^{2}}{540.4} = -1.1\) (note how the second number is now negative). This gives you a sense for each cell of which are over their expected values, which are under, and the relative degrees. Here are the differences by cell:
Dem | Indep | Rep | |||
---|---|---|---|---|---|
Female | \(4.80\) | \(-1.10\) | \(-1.48\) | ||
Male | \(-5.76\) | \(1.32\) | \(-1.77\) |
So we can see that the main divergence is among Democrats, where we see many more female Democrats and many fewer male Democrats than we would expect by chance. In the next module we will be examining correlations and then later regressions, which would show a negative correlation between the Democrat-Republican spectrum and the male-female spectrum. But although regression methods are in general much more robust, especially when dealing with more than two variables, the \(\chi^2\) test is nice in that it shows in this case that it’s not just a matter of more males / fewer females the further you go from Democrat to Republican, but more a specific function of Democrats in particular where the gender divide is most pronounced. What that means substantively, of course, remains up to the researcher.
We can also yet again conduct this test directly in R.
sexparty <- data.frame(dem=c(573,386),indep=c(516,475),rep=c(422,399),row.names=c("female","male"))
sexparty
dem indep rep
female 573 516 422
male 386 475 399
chisq.test(sexparty)
Pearson's Chi-squared test
data: sexparty
X-squared = 16.202, df = 2, p-value = 0.0003033
Now that we have seen how to do this simple test for the dependence between two groups with categorical variables, we are ready to move on to the more complex case, where our variables are continuous values. This will lead us into correlation and regression methods in the next modules.