Overview

This lesson introduces ANOVA and the F test.

Objectives

After completing this module, students should be able to:

Test for differences among multiple groups.
Explain the derivation of the F test.
Conduct an F test by hand and using R.

Reading

Lander, Chapter 15.4. Schumacker, Chapter 14.

1 Testing multiple groups

In the last lesson we moved from a single sample test to a two-sample test. But what if we have more than two groups? Say we have a more complex experiment, with a number of different treatments, and we wish to know whether at least one of them is different from the others. This kind of question actually comes up surprisingly often, not just in experiments, but in multiple regression, where we are investigating the effects of potentially numerous independent variables on some dependent variable, and might wish to know whether at least one of them has an effect.

In both observational and experimental data, it’s sometimes the case that if you compare any one of your samples to some baseline (eg, the control group), it doesn’t appear significantly different (eg, via a t test), but if you look at lots of different treatments, although indiviudally they may each not be significant, across all of them it may achieve significance. (Eg, you have five related treatments, each of which does not produce significantly longer life, but all five of them nevertheless have slightly higher means than the control, suggesting that collectively you may be on to something.)

2 ANOVA

The traditional way this sort of multiple-mean test has been conducted is via an Analysis of Variance, or ANOVA. The specific test is called the F test, and it works much like the t test: you calculate an F statistic (test statistic) given your sample, and compare that to a threshold value which is based on the F distribution. If the test statistic is in the rejection region (or equivalently, the p-value is sufficiently low), you reject the null that all the means are equal, and accept the alternative that at least one of them is different – although the test itself doesn’t tell you which.

Formally, we have:

\(H_{0}\) = \(\mu_{1} = \mu_{2} = ... = \mu_{g}\)

\(H_{a}\) = at least one is different.

Recall that for the t test, we have two samples, and want to know whether they are significantly different. It’s not enough to just take the difference between the two means: two means maybe be 100 apart, but if they both have standard errors of 1000, they are unlikey to be significantly different. Conversely, two means that are 1 apart but have standard errors of 0.001 are clearly siginficantly different. Thus it’s a matter of the difference relative to the standard errors – and the same holds for multiple groups.

In this image we see a bunch of different groups, each with their own standard deviation. The same means may be significantly or non-significantly different depending on the spread of each distribution. The more different the means of each group (apex of each mound), the greater the probability that the groups are different; but the more spread out each group is, the less different they are likely to be. So we need a ratio as a measure of the total probability that they are different:

F-statistic = \(\frac{\textrm{average variance between groups}}{\textrm{average variance within groups}}\)

The higher the numerator, the more different they are; but the higher the denomintor, the less significant that difference is.

2.1 Between-group variance

So let’s take the numerator and denominator separately. For the numerator, we want an overall measure of how different the means of all the groups are from each other. This is the between-group or between variance.

Say there are \(G\) groups, each of size \(n_i\), and each with mean \(\bar{y}_i\). If the overall mean of all of the samples pooled together is \(\bar{y}\), then we can measure how much each group \(i\) differs from that overall mean, and add up all those differences to get a single number that captures the variance of all those \(\bar{y}_i\) values around their collective mean \(\bar{y}\). This works almost exactly like the calculation of variance for a single sample, except that now we have group means \(\bar{y}_i\) instead of individuals \(x_i\), and the \(n\) is not the number of individuals, but the number of groups \(G\):

\[\textrm{Between variance } = \frac{n_{1}(\bar{y}_{1} - \bar{y})^{2}+ ... + n_{G}(\bar{y}_{G} - \bar{y})^{2} }{G-1}\]

There is one more difference, as you can see: each difference \((\bar{y}_{i} - \bar{y})^{2}\) is weighted (multiplied) by the size of the group \(n_i\); this makes sense, since a group of 1000 individuals should have much more effect on the overall difference between the groups than a group of just 3.

2.2 Within-group variance

Next we want the denominator of F: a measure to the average variance within each group – the spread of each mound. This is the within-group or within variance. The calculation here is even more straight-forward: it resembles the average variance of all the groups, but again weighted by (approximately) the size of each group:

\[\textrm{Within variance } = \frac{(n_{1}-1)s_{1}^{2}+ ... + (n_{G}-1)s_{G}^{2} }{N-G}\]

Once again, each individual variance \(s_i^2\) is weighted by the size of that group \(n_i-1\) (minus 1 for degree-of-freedom reasons), reflecting greater contribution to the total of the bigger groups. On the denominator we have something a little different: N is the total number of individuals in all the groups, but we divide not by N but by N-G for yet another degree-of-freedom reason – essentially, we have a slightly smaller sample because of the groups, so we subtract the number of groups G from N, and divide instead by N-G. But since N is usually much larger than G, that’s often pretty close to just dividing by N.

3 Example: Party ID and Ideology

Say we are interested in whether Party ID (Democrat or Republican) and Ideology (Liberal vs Conservative) are related to each other. We ask a few hundred people their party ID and their ideology on a 7-point scale from strong-liberal to strong-conservative. The following table shows the number of people in each sub-category:

Is there some association between party ID and ideology here? One way to answer this is to look at the mean ideological score for each party group: Democrats, Independents, and Republicans, and ask whether or not they are all the same.

This is a test of three groups, and thus we need the F test. Remember though that the F test is a limited test: it only tests for whether at least one of these groups is different from the others; it doesn’t tell us which ones.

3.1 Calculating the F statistic

F-statistic = \(\frac{\textrm{average variance between groups}}{\textrm{average variance within groups}}\)

\(\textrm{Between variance } = \frac{n_{1}(\bar{y}_{1} - \bar{y})^{2}+ ... + n_{G}(\bar{y}_{G} - \bar{y})^{2} }{G-1}\)

\(\textrm{Within variance } = \frac{(n_{1}-1)s_{1}^{2}+ ... + (n_{G}-1)s_{G}^{2} }{N-G}\)

Overall mean \(\bar{y} = 3.89\).

\(BV = \frac{91(3.23-3.89)^{2} + 111(3.90-3.89)^{2} + 74(4.70-3.89)^{2}}{3-1} = 44.2\)

\(WV = \frac{(91-1)(1.28)^{2} + (111-1)(1.43)^{2} + (74-1)(1.10)^{2}}{276-3} = 1.68\)

\(F = 44.2/1.68 = 26.3\).

As with the t test, of course, we can’t say anything about whether 26.3 is big enough to reject the null or not unless we know the F distribution. And as with the t, normal, and all the others, there are shape parameters too; as with the t, those shape parameters are determined by the degrees of freedom of the system.

3.2 Degrees of freedom

The t distribution has one parameter (besides the mean and se) that affects its shape, the degree of freedom, which determines how much of the mass of the distribution is in the tails vs in the middle. The F distribution, by contrast, has two shape parameters, reflecting two different degrees of freedom:

The first degree of freedom is the denominator on the between variance: \(df_{1} = G-1\). The second degree of freedom is denominator on the within variance: \(df_{2} = N-G\). These two together determine the shape of the F distribution, and thus whether the F test statistic is large enough to reject the null.

Again, as with the t, the idea is that we assume the null is true, that these groups are all drawn from the same underlying population, and any observed varation is simply due to chance. The variation among these groups – the F number – will tend (if we were to do the sampling repeatedly) to be distributed not with a t distribution, but an F distribution. Given that the F distribution defines the random varation we would see, we need to know wheter the F stastitic we actually got is sufficiently unlikely to have been drawn by chance – if it’s sufficiently far out in the tail – that we can decide that the variation among the means is not due to chance alone, and instead likely reflects a genuine difference between the groups. If so, we reject the null, that all the means are the same, in favor of the alternative, that at least one is different from the rest.

Note that unlike the t, the F is always positive – we are adding up a bunch of squared numbers. We are only interested in big values: how different are our groups. A big number reflects more different means, whereas the smallest number, 0, reflects means that are all identical (the null). Thus the F test is always one-tailed.

3.3 Calculating the F threshold

Returning to our example, our degrees of freedom are: \(df_{1} = G-1 = 2\) and \(df_{2} = N-G = 273\). And our F statistic is \(F = 26.3\).

Once again, we can calculate our threshold values using R, or using a table. To do so using R, we once again use our \(\alpha\) level; if \(\alpha = 0.05\), then we want the F value such that area of the distribution greater than this value (ie, the right tail) is 5% of the total.

Thus we can calculate our threshold value as:

qf(0.95,2,273)

[1] 3.028847

The value we got – 26.3 – is clearly far greater than this, and thus is well within the rejection region. That means we can conclude that such a large number was unlikely to have occurred by chance assuming the null hypothesis that each of the three groups was drawn from the same population. So we reject the null that the means of all three groups are the same, in favor the the alternative hypothesis that at least one of them is different. Substantively, this isn’t too surprising given that we already believe that party ID and ideology are not at all independent of each other.

3.4 Using p-values or tables

Again as with the t test, we can also calculate directly the p-value for the score we got (26.3), which is the probability of getting something that large or larger assuming the null is true.

1-pf(26.3,2,273)

[1] 3.587419e-11

This is clearly much, much lower that 0.05, so once again we can reject the null.

We can also do our calculations the old-fashioned way, using F tables. But putting this in tables is even more complicated than with the t distribution, since now we have three variables: the two degrees of freedom, plus the \(\alpha\) level. But since we so often use an \(\alpha\) of 0.05, we can just give a table for that level: along the left is the first degree of freedom, along the top is the second, and the values in table are the critical threshold values.

…

As we can see for our example, with \(df_{1} = 2\) and \(df_{2} = 273\), our critical value is around 3, and 26.3 is well above it, so once again we reject the null.

4 Doing the F test in R

Of course, we can also use R to conduct the entire F test for us, just as we can do the t test with t.test. The function in this case is aov, for ANOVA. In this example we use experimental data from the Personality Project (see R code for the URL).

datafilename="http://personality-project.org/r/datasets/R.appendix1.data"
data.ex1=read.table(datafilename,header=T)
head(data.ex1)

  Dosage Alertness
1      a        30
2      a        38
3      a        35
4      a        41
5      a        27
6      a        24

In this case, we have “altertness” measurements for three different groups, each of which received a different dosage of some drug. We might want to test whether across all these different dosages (coded as a, b and c), there is any difference at all in alterness levels. Thus we run ANOVA:

aov.ex1 = aov(Alertness~Dosage,data=data.ex1) 
summary(aov.ex1)

            Df Sum Sq Mean Sq F value  Pr(>F)   
Dosage       2  426.2  213.12   8.789 0.00298 **
Residuals   15  363.8   24.25                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This immediately gives the p-value, which is much less than 0.05, and therefore we can reject the null that these three groups are equal. To check R’s work, we can see that our between variance is 213.12 and our within variance is 24.25, thus our F is \(213.12/ 24.25 = 8.788\). Our degrees of freedom are \(df_1 = 2\) (there are three groups) and \(df_2 = 15\) (there are 18 observations). The p value is therefore 1-pf(8.788,2,15) = 0.00298.

Although we won’t go into it here, there are also ANOVA equivalents of the dependent-sample test for the t (two-sample) test. But in some ways, the ANOVA has become less common these days, being largely replaced with regression methods, which we will turn to in the next modules. However, the F test underlying ANOVA remains important even in regression, as we will soon see.

Computational Statistics 6.1: ANOVA and the F-Test