Overview

This lesson shows how to conduct difference in means tests.

Objectives

After completing this module, students should be able to:

Recognize when to apply a difference-in-means test.
Apply and calculate this test using experimental data.
Calculate test statistics and degrees of freedom.
Conduct tests using independent and dependent samples.

Reading

Lander, Chapter 15.3.2. Schumacker, Ch. 13.

1 Testing with two independent samples

So far we have just looked at tests against a specific numerical null, such as \(\mu = 0\). But perhaps a more common question is not whether a population mean is different from some number, but instead whether two population means are different from each other. Eg, are men and women of different heights? Or are drivers on a phone worse than drivers listening to the radio? The difference here is that we have two standard errors associated with two means, and not just one mean and se versus a fixed number.

Our height question might be formulated as:

\(H_{a}\): The heights of men and women are different.

\(H_{0}\): The heights of men and women are the same.

As before, these are hypotheses about the population:

\(H_{a}\): \(\mu_{m} \neq \mu_{w}\)

\(H_{0}\): \(\mu_{m} = \mu_{w}\)

And again, we don’t have to prove \(\mu_{m} = \mu_{w}\); that’s the null hypothesis. Instead, we have to reject the null by showing that the two means are sufficiently different (relative to their standard errors) and that that difference is unlikely to be due to chance alone.

1.1 Experiments: Treatment vs Control

The reason this kind of test is so important is that this is the way most experiments work: we measure, say, reaction times for people using a cellphone vs the radio, and then test whether the reaction times of the first group is larger than the second. All of experimental science is based, fundamentally, on this one test. (Plus a few others we’ll look at in the next module.)

To formulate it more prcisely, the question for an experiment is: Does the mean (for some measurement of interest) for the treatment group differ from the control group?

\(H_{a}\): \(\mu_{t} \neq \mu_{c}\)

\(H_{0}\): \(\mu_{t} = \mu_{c}\)

Again, it is up to us to decide what the control is. For our cellphone driving experiment, there can be various versions of the null:

\(H_{0}\): Talking on the phone is no different than paying full attention.

\(H_{0}\): Talking on the phone is no different than listening to the radio.

Deciding between these two is not a matter of statistics but of theory, and it just depends on what theory we are testing: cellphones vs nothing, or cellphones vs listening alone. There could even be others. But the main idea is that these are binary tests: treatment vs control, and we need statistics to adjudicate.

1.2 Cellphone experiment in more detail

As you probably know, the usual approach in experimental science to ensure that participants are otherwise the same apart from the treatment or control condition is to assign people at random to one group or the other. That way the only difference will be the condition – or at least, that will be the only systematic difference.

To return to our cellphone experiment, we assign one half the subjects at random to talk on the phone, and the other to listen to the radio. They are using a driving simulator, and at random intervals are shown sudden red lights and asked to stop as quickly as they can. We measure the (mean) reaction time for each person, so our data consist of one number for each person: how long (on average) it takes them to brake in response to the red light.

Say these are our data:

Cell-phone group, mean reaction time = \(\bar{y}_{cell} = 585.2 ms\); \(s_{cell} = 89.6\), \(n = 32\).

Control group, mean reaction time = \(\bar{y}_{radio} = 533.7 ms\); \(s_{radio} = 65.3\), \(n = 32\).

1.3 Review of the three approaches to hypothesis testing

We can see a difference between the two groups in the plot above, and the difference in their means is \(585.2 - 533.7 = 51.5ms\). But is that difference statistically significant, or is the difference we see just due to chance (the null hypothesis)? That is, do cell phones slow down reactions (the two groups are different) or not (the two groups are the same, and the difference we see is just due to random chance)?

To answer that question, as before there are three approaches, all of which are equivalent and produce the same answer:

Is the test statistic number we calculate larger than a threshold value, in which case we can reject the null that that statistic is just due to chance?
Is the probability of getting that test statistic number sufficiently small that we can reject the null?
Does the confidence interval we calculate using the threshold value encompass 0 or not? If not, then we can reject the null that there is no difference between the groups.

For the first approach we need to calculate the test statistic and threshold value, and to calculate the latter based on the t distribution, we need the degree of freedom. For the second approach, we need to calculate the test statistic and p-value, and to calculate the latter we need the degree of freedom. And for the third, we need the numerator and denominator from the test statistic (ie, the difference in the means and the standard error of the difference), as well as the threshold value in order to calculate the confidence interval.

1.4 The test statistic with two samples

From the one-sample test, we know that a sample \(\bar{x}\) will be t-distributed around the mean \(\mu\), ie that \(\frac{\bar{x} - \mu_{0}}{se}\) will be distributed according to the t distribution. Things are almost the same here, except that now it’s the difference between the two means that is t-distributed:

\[\textrm{T statistic} = \frac{\bar{y}_{cell} - \bar{y}_{radio}}{se_{diff}}\]

What is the standard error associated with this difference in means? It’s not the standard error of the cellphone group and it’s not the se of the control group – it must be some combination of the two. Sepecifically, it is:

\[se_{diff} = \sqrt{se_{1}^{2} + se_{2}^{2}}\]

(You might recall this equation from the length of the hypotenuse of a right triangle; the geometry is similar but not worth going into here.)

So for our example we have:

Cell: \(se_{1} = s_{1} / \sqrt{n_{1}} = 89.6 / \sqrt{32} = 15.84\)

Radio: \(se_{2} = s_{2} / \sqrt{n_2} = 65.3 / \sqrt{32} = 11.54\)

\(se_{diff} = \sqrt{se_{1}^{2} + se_{2}^{2}} = \sqrt{15.84^{2} + 11.54^{2}} = 19.6\)

Our test statistic is therefore: \((585.2 - 533.7) / 19.6 = 2.62\).

1.5 Calculating the degrees of freedom

For any of the three testing approaches discussed above, we also need to know the degree of freedom in order to calculate either the threshold value or the p value.

What is the degrees of freedom then? Both groups are \(n = 32\), and before the answer was \(n-1\). So does that mean the degrees of freedom here is 31, or 31+31, or 32+32-1, or what? What happens if the two groups are of different sizes? It turns out that with two groups we need to know not just what their \(n\)s are, but also what their standard deviations are.

When \(n\) and \(s\) are equal across the two groups, the degree of freedom is relatively simple: \(df = 2n - 2\).

When \(s\) is equal but \(n\) isn’t, \(df = n_{a} + n_{b} - 2\) (note how this reduces to the previous one if \(n_a = n_b\)).

And finally, for any \(n\) and \(s\) we have the unwieldy:

\[df = \frac{se_{diff}^{4}}{se_{a}^{4}/(n_{a}-1) + se_{b}^{4}/(n_{b}-1) }\]

With a little bit of algebra, you can show that the general version reduces to the simpler versions when \(n\) or \(s\) are equal. Note that this equation will usually produce fractions rather than whole numbers, but that’s fine, the degrees of freedom doesn’t have to be an integer.

In our example, \(n\) is equal but \(s\) isn’t, so we have to use the general equation: \(df = 19.6^4/(15.84^4/(32-1) + 11.54^4/(32-1)) = 56.70\).

1.6 Finishing our tests

Let’s finish our hypothesis test using each of the three approaches, assuming an alpha of 0.05 and a two-tailed test.

Test statistic = 2.62. Threshold = qt(0.975,56.70) = 2.00. The test statistic is above the critical threshold, so we reject the null and conclude that the difference is statistically signficant.
The p value is 2*pt(2.62,56.70,lower.tail=F) = 0.005. This is less than 0.05, so we again reject the null.
The 95% Confidence intervale is the difference in means, \(585.2 - 533.7 = 51.5\), plus or minus the threshold value times the standard error of the difference, or \(2.00*19.6 = 39.2\). The \(CI_{0.95} = [12.3, 90.7]\) This does not include 0, so again we reject the null.

Remember that since these are all equivalent, you don’t actually have to do all three – doing any of them will answer the fundamental question of whether the difference between the group means is statistically significantly different.

2 Dependent Samples

There is one last variation on the two-sample test, which is actually a bit simpler than the independent sample case we discussed previously.

Say that instead of two separate groups of 32 people each (ie, 64 different people total), this experiment actually used the same 32 people in both conditions (presumably randomizing the order each person got the treatment vs control).

These would then be dependent samples: each person has a cell score and a radio score. As before, we are interested in whether one group has a different mean from the other, but now there is a much easier way to do it. We just create a new data column that is the individual differences, and do a one-sample test using that single vector of data:

person 1: cell = 636, radio = 604, diff = 636 - 604 = 32

person 2: cell = 623, radio = 556, diff = 623 - 556 = 67

person 3: cell = 615, radio = 540, diff = 615 - 540 = 75

… and so on.

Thus our vector of differences is: \(y_{diff} = \{32,67,75,...\}\)

At this point our test is just a standard one-sample test, where the question is, is \(\bar{y}_{diff}\) different from 0? Ie, our null hypothesis is simply \(\bar{y}_{diff} = 0\). Note that we can only do this because each row of the original data is the same individual, allowing us to subtract one column from the other to boil it down to a single vector; this is also known as a paired test.

So to do this we proceed just as we have done in the one-sample test, and once again any of three testing approaches is fine. From the \(y_{diff}\) vector we can calculate \(s_{y_{diff}} = 52.5\) and thus \(se_{y_{diff}} = 52.5 / \sqrt{32} = 9.28\). The test statistic is therefore \(51.4 / 9.28 = 5.53\), and the degree of freedom is \(n-1\) or 31. Since the threshold value is qt(0.975,31) = 2.04 5.53 is clearly in the rejection region. The p-value is 2*pt(5.53,31,lower.tail=F) = 0.0000047, so again we reject the null. And the 95% CI is \(51.4 \pm 2.04 \cdot 9.28 = [32.47, 70.33]\), which does not include 0 and thus again we reject the null.

3 Calculating with R

As before, we can of course do all these tests in R.

For the independent sample test, here’s a silly example using the built-in cars dataset:

head(cars)

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

t.test(cars$speed, cars$dist)


    Welch Two Sample t-test

data:  cars$speed and cars$dist
t = -7.4134, df = 53.119, p-value = 9.628e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -35.04152 -20.11848
sample estimates:
mean of x mean of y 
    15.40     42.98

Each of the values you see in the R output above you can now calculate yourself (t, df, p-value, CI, group means), so we don’t really need R – but it certainly makes life easier! Unsurprisingly, car speed and car distance are not statistically equal.

Another common approach is when you have one vector (such as reaction times) and a second dummy variable that denotes which group (such as treatment vs control) each observation belongs to. Let’s create a dummy variable like that (for instance, it could be cars that got a special gasoline or not):

carfake <- cbind(cars,dummy=round(runif(50)))
head(carfake)

  speed dist dummy
1     4    2     0
2     4   10     0
3     7    4     0
4     7   22     0
5     8   16     1
6     9   10     0

Now to test for a difference between the two subsets of dist we would run:

t.test(dist ~ dummy, data=carfake)


    Welch Two Sample t-test

data:  dist by dummy
t = -0.7828, df = 46.47, p-value = 0.4377
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -19.969470   8.784285
sample estimates:
mean in group 0 mean in group 1 
       40.40741        46.00000

The above are both independent sample tests, since the two groups are totally different individuals. If you want a dependent sample test with two matched columns where each row is the same individual, you can run it this way:

simdat <- data.frame(x=1:10,y=rnorm(10))
t.test(simdat$x,simdat$y, paired=TRUE)


    Paired t-test

data:  simdat$x and simdat$y
t = 6.2647, df = 9, p-value = 0.000147
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 3.819359 8.136604
sample estimates:
mean of the differences 
               5.977982

Computational Statistics 5.3: Difference in means testing