<

Difference in means testing

Overview

This lesson shows how to conduct difference in means tests.

Objectives

After completing this module, students should be able to:

  1. Recognize when to apply a difference-in-means test.
  2. Apply and calculate this test using experimental data.
  3. Calculate test statistics and degrees of freedom.
  4. Conduct test using dependent samples.

Readings

Lander, Chapter 15.3.2. Schumacker, Ch. 13.

Testing with two samples

So far we have just looked at tests against a specific numerical null, such as \(\mu = 0\). But perhaps a more common question is not whether a population mean is different from some number, but instead whether two population means are different from each other. Eg, are men and women of different heights? Or are drivers on a phone worse than drivers listening to the radio? The difference here is that we have two standard errors associated with two means, and not just one mean and se versus a fixed number.

Our height question might be formulated as:

\(H_{a}\): The heights of men and women are different.

\(H_{0}\): The heights of men and women are the same.

As before, these are hypotheses about the population:

\(H_{a}\): \(\mu_{m} \neq \mu_{w}\)

\(H_{0}\): \(\mu_{m} = \mu_{w}\)

And again, we don’t have to prove \(\mu_{m} = \mu_{w}\); that’s the null hypothesis. Instead, we have to reject the null by showing that the two means are sufficiently different (relative to their standard errors) and that that difference is unlikely to be due to chance alone.

Experiments: Treatment vs Control

The reason this kind of test is so important is that this is the way most experiments work: we measure, say, reaction times for people using a cellphone vs the radio, and then test whether the reaction times of the first group is larger than the second. All of experimental science is based, fundamentally, on this one test. (Plus a few others we’ll look at in the next module.)

To formulate it more prcisely, the question for an experiment is: Does the mean (for some measurement of interest) for the treatment group differ from the control group?

\(H_{a}\): \(\mu_{t} \neq \mu_{c}\)

\(H_{0}\): \(\mu_{t} = \mu_{c}\)

Again, it is up to us to decide what the control is. For our cellphone driving experiment, there can be various versions of the null:

\(H_{0}\): Talking on the phone is no different than paying full attention.

\(H_{0}\): Talking on the phone is no different than listening to the radio.

Deciding between these two is not a matter of statistics but of theory, and it just depends on what theory we are testing: cellphones vs nothing, or cellphones vs listening alone. There could even be others. But the main idea is that these are binary tests: treatment vs control, and we need statistics to adjudicate.

Cellphone experiment in more detail

As you probably know, the usual approach in experimental science to ensure that participants are otherwise the same apart from the treatment or control condition is to assign people at random to one group or the other. That way the only difference will be the condition – or at least, that will be the only systematic difference.

To return to our cellphone experiment, we assign one half the subjects at random to talk on the phone, and the other to listen to the radio. They are using a driving simulator, and at random intervals are shown sudden red lights and asked to stop as quickly as they can. We measure the (mean) reaction time for each person, so our data consist of one number for each person: how long (on average) it takes them to brake in response to the red light.

Say these are our data:

Cell-phone group, mean reaction time = \(\bar{y}_{cell} = 585.2 ms\); \(s_{cell} = 89.6\), \(n = 32\).

Control group, mean reaction time = \(\bar{y}_{radio} = 533.7 ms\); \(s_{radio} = 65.3\), \(n = 32\).

Is the difference between these two groups statistically significant, or just due to chance?

The test statistic with two samples

We can easily calculate the difference between the mean reaction time for the two groups:

\(\bar{y}_{cell} - \bar{y}_{radio} = 51.4 ms\)

But even though the cell group is larger (ie, they are slower), is that difference more than we might expect to see by chance alone?

In order to answer this, as usual we need to calculate a test statistic. We knew before that a sample \(\bar{x}\) will distributed normally around the mean \(\mu\), or in other words that \(\frac{\bar{x} - \mu_{0}}{se}\) will be distributed according to the t distribution (ie, almost normally). Things are almost the same here, except that now our test statistic is this:

\[\frac{\bar{y}_{cell} - \bar{y}_{radio}}{se_{diff}}\]

There is no \(\mu_0\) of course, just the two \(\bar{y}\) values (we’ve switched to \(y\) to distinguish this case from the single-sample case, but the variable is arbitrary). As we might expect, this difference is distributed according to the t distribution, but what is the standard error? It’s not the standard error of the cellphone group and it’s not the se of the control group – it must be some combination of the two. Sepecifically, it is:

\[se_{diff} = \sqrt{se_{1}^{2} + se_{2}^{2}}\]

(You might recall this equation from the length of the hypotenuse of a right triangle; the geometry is similar but not worth going into here.)

Calculating the test statistic

So in our example we have:

Cell: \(se_{1} = s_{1} / \sqrt{n_{1}} = 89.6 / \sqrt{32} = 15.84\)

Radio: \(se_{2} = s_{2} / \sqrt{n_2} = 65.3 / \sqrt{32} = 11.54\)

\(se_{diff} = \sqrt{se_{1}^{2} + se_{2}^{2}} = \sqrt{15.84^{2} + 11.54^{2}} = 19.6\)

Our test statistic is therefore: \((585.2 - 533.7) / 19.6 = 2.62\).

So is this above our theshold, assuming a two-tailed test with an \(\alpha\) of 0.05?

Degree of freedom

Well, now we run into another problem – what’s the degree of freedom? Both groups are \(n = 32\), but does that mean it’s 31, or 63, or what? And what happens if the two groups are of different sizes?

Recall that our original degree of freedom for a single-sample test was \(df = n-1\). Alas, with two groups it matters not just what their \(n\)s are, but also what their standard deviations are.

When \(n\) and \(s\) are equal across the two groups, the degree of freedom is relatively simple: \(df = 2n - 2\).

When \(s\) is equal but \(n\) isn’t, \(df = n_{a} + n_{b} - 2\) (note how this reduces to the previous one if \(n_a = n_b\)).

And finally, if \(n\) and \(s\) are both different (or actually, for any \(n\) and \(s\)), we have the unwieldy:

\[df = \frac{se_{diff}^{4}}{se_{a}^{4}/(n_{a}-1) + se_{b}^{4}/(n_{b}-1) }\]

(Don’t worry, you will never be asked to remember this!) Note that this does in general give fractions rather than whole numbers, but that’s fine, qt() or pt() are unphased by that.

Finally, once we’ve determined our degrees of freedom, we just figure out the critical threshold value in the usual way (using qt()), and ask whether our test statistic is greater than it (ie, whether it’s in the rejection region).

Dependent Samples

There is one last variation on the two-sample test, which is actually a bit simpler than the more general case we discussed previously.

Say that instead of two groups of 32 people, this experiment actually used the same 32 people in both conditions (presumably randomizing the order each person got the treatment vs control).

These would then be dependent samples: each person has a cell score and a radio score. As before, we are interested in whether one group has a different mean from the other, but now there is an easier way to do it: We just create a new data column that is the individual differences:

person 1: cell = 636, radio = 604, diff = 32

person 2: cell = 623, radio = 556, diff = 67

person 3: cell = 615, radio = 540, diff = 75

\(y_{diff} = \{32,67,75,...\}\)

Now we just look at \(y_{diff}\), where our question is, is \(\bar{y}_{diff}\) different from 0? Ie, our null hypothesis is simply \(\bar{y}_{diff} = 0\).

To test this, we just procede exactly as we’ve always done with a single-sample test: we can calculate that \(s_{y_{diff}} = 52.5\) and thus \(se_{y_{diff}} = 52.5 / \sqrt{32} = 9.28\). This is also known as a paired t-test.

Our test statistic is therefore \(51.4 / 9.28 = 5.53\) and our degree of freedom is \(n-1\) or 31, which means 5.53 is clearly in the rejection region. If we wanted to construct our 95% CI around the difference, it would be \(51.4 \pm 2.04 \cdot 9.28\), which of course does not include 0. (Make sure you know where 2.04 came from.)

Calculating with R

As before, we can do a t test with two groups using R. Usually R assumes you have a single data frame with at least two variables: one for the \(x\), and another binary variable that demarcates which observation is the treatment vs control (eg).

So let’s just construct some fake data using the built-in cars dataset:

carfake <- cbind(cars,dummy=round(runif(50)))

Now let’s test whether the cars$dist distances are different between the two groups (they shouldn’t be, since the two groups were assigned at random!):

t.test(dist ~ dummy, data=carfake)

    Welch Two Sample t-test

data:  dist by dummy
t = -1.2253, df = 30.868, p-value = 0.2297
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -24.287760   6.059189
sample estimates:
mean in group 0 mean in group 1 
       36.60000        45.71429