Overview
This lesson introduces the basic concepts of a hypothesis test.
Objectives
After completing this module, students should be able to:
Reading
Lander, Chapter 15.3.1. Schumacker, Ch. 10
So far, we have mainly applied statistics to the job of sampling: estimating true, unknown population parameters using random samples from those populations. But in the social sciences – and the sciences more generally – we are often interesting in answering specific research questions. For instance, is the average American liberal or conservative? Do cellphones make people worse drivers? And so on. There are a million theories we might have about the world, and what we want to do in buidling up a scientific body of knowledge is decide which of these hypotheses are true, and which are false. To do this, we almost always need statistical tests.
The basic idea is that, at any given time, we have a store of theories we believe, and we want to know whether to add a new hypothesis (or claim or theory) to our store of knowledge, usually at the cost of replacing something we had already believed, or at least the simplest default theory we might have had if we had thought about it.
For instance, here are two hypotheses again, put into slightly (although not entirely) more precise form:
Hypothesis 1: The average American is conservative
Hypothesis 2: People talking on cellphones drive worse than people not talking on cellphones.
But if we want to test these hypotheses, we need to know what the alternative is – what we consider the plausible default belief. This is called the null hypothesis.
There is much ground for debate about what the null hypothsis is for any given test hypothesis, but the usual idea is to set the burden of proof high: We usually take as the null, default hypothesis something neutral and broad, the sort of thing most people might believe just by default. For instance:
Null hypothesis 1: The average American is a political moderate.
Null hypothesis 2: People talking on cellphones are the same as other drivers.
But we can be more precise still. Say for 2 we are interested in the effect of conversing on the phone, not just the effect of hearing stuff. Then we might have:
Null hypothesis 2: People talking on cellphones drive the same as drivers listening to the radio.
But whatever the null, the key thing is that it must exist for us to test our hypothesis against. Nothing is ever proven: we always just knock out a worse hypothesis and replace it with a (provisionally) better one.
Of course, statistics is a mathematical science, so we need to put these theories in numerical form.
For 1, we have to survey people and ask them their political leanings, eg on a -10 to 10 scale (where + is conservative, say). Then our competing hypotheses would be:
Hypothesis 1: The average political score is \(> 0\).
Null hypothesis 1: The average political score is \(= 0\).
So how do we test between these two?
As we should now expect, we begin by taking a sample from the population – remember, we are ultimately interested in the whole population – and survey the sample. But say we survey our sample, and our data are: \(\bar{x} = 2\), \(s = 5\), \(n = 100\).
What does this tell us? \(\bar{x}\) by itself tells us nothing, since the chance of \(\bar{x}\) being exactly 0 is almost non-existent, so if the test was just whether \(\bar{x}\) is greater than 0, well, even if the true \(\mu\) actually equalled 0, by random sampling variation we would be bound to get something above or below 0 by sheer chance, and thus we would pick our test hypothesis over the null hypothesis 50% of the time (because 50% of the time \(\bar{x}\) would be positive, and 50% negative) – and thus whichever way the true lies, we would be wrong half the time!
So how can we guard against this error – falsely rejecting the null when in fact we shouldn’t have?
As you might guess, the answer lies with the standard errors and confidence intervals we developed earlier. What is the standard error in this case?
Remember, the standard error works in two ways. The standard error determines the distribution of the \(\bar{x}\) values around the truth \(\mu\), if we were to do our survey over and over. But the standard error also (with the re-arranging algebra we saw previously) allows to make claims about where \(\mu\) is relative to \(\bar{x}\).
So here’s our logic for a test. Assume the null hypothesis is true, or at least seems to be. Then what is the chance of getting the \(\bar{x}\) we got (\(2\)), assuming the null hypothesis (\(\mu = 0\)) is true? If we believe the true height of a human is 5’8" and we do a random survey and all 100 surveyed people are over 6 feet, either our survey was biased, or we got spectacularly unlucky, or we were wrong in our default hypothesis that the average human height was 5’8“. If we don’t think we were biased, and don’t believe in spectacular luck, then our only option is to reject our default theory (the null hypothesis) that the average height is 5’8” – there is almost no chance we would have gotten the data we got if 5’8" was the true population mean. And once we have rejected that null, we can (provisionally) accept our current best guess – the mean of our new survey.
So how do we know when we have enough evidence to reject the null?
Returning to our political survey example, if the null hypothesis is true – the average American has a political score of 0 – then how unlikely is it to have gotten a \(\bar{x}\) of 2 for a survey of 100 people? We know from the Central Limit Theorem, if we were to conduct our 100-person survey 1000 times, 95% of the final results – 95% of the \(\bar{x}\)’s – would be within \(\mu \pm 2*se\) (the 2 is approximate of course). So if the null hypothesis was true, there a 95% chance that \(\bar{x}\) would be between \(0 - 2*0.5\) and \(0 + 2*0.5\) or in the range \([-1,1]\). The actual result we got of \(2\) is well outside that range, so we would have to be spectacularly unlucky to get that value if the truth was in fact 0. Assuming that our suvey was unbiased, we can instead confidently say that the null hypothsis– the average American is a moderate with \(\mu = 0\) – is not true.
We can also flip the test around for a second version, which is logically identical to the first. Returning to the logic of the survey, we can say we are 95% confident that the true \(\mu\) is between \(\bar{x} \pm 2*0.5\). Since the null hypothesis, \(\mu = 0\), is well outside this confidence interval, we can again reject the hypothesis that \(\mu = 0\).
Once we’ve concluded that the null hypothesis is not true, then what? Strictly speaking, we are left with our research hypothesis, that \(\mu > 0\), and nothing more. This is the usual outcome if reject the null: we say we “reject the null in favor of the research hypothsis.” We have not proven our hypothesis, nor do we necessarily “accept” the research hypothesis, although people often use the latter locution. One of the philosophical oddities of the frequentist approach, as this whole method of statistical testing is known, is that you never really prove anything, just disprove stuff, and at best provisionally accept things until they are ultimately replaced with even better theories. But that’s philosophy – the main take-away here is that we generally just speak of rejecting the null in favor the research hypothesis.
On the other hand, we do still have our data, with our best guess that \(\mu = \bar{x}\), or 2. So if we wholly reject the null, our best guess is now 2 for \(\mu\). If we still had some lingering belief in the null, then we might instead think now that the truth is likely somehere inbetween 0 – our original theory – and 2 – our new theory. This is the Bayesian approach – but though plausible, it’s not the approach taken in traditional statistics. For now, we are frequentists, and our test either rejects the null – in which case we provisionally accept whatever the research hypothesis is – or we “fail to reject the null”, in which case the null hypothesis is our continuing belief.
And to be clear, we “fail to reject the null” when our data – eg, \(\bar{x}\) – could plausibly be consistent with the null hypothesis. So again, if our data was \(\bar{x} = 2\) with \(se = 10\), then this is quite consistent with \(\mu = 0\), and getting a \(\bar{x} = 2\) from a random sample under these circumstances would not be unlikely at all. We will make this more precise shortly.
So far we have been talking about 95% confidence intervals, but why 95%? Basically, what we want to do is say, if our data are really unlikely given the assumption that the null is true, then we reject the null in favor of our research hypothesis; if the data aren’t all that unlikely given that the null is true, then we fail to reject the null, and continue under the assumption that the null is true. But how unlikely is “really unlikely”?
Well, if \(\bar{x}\) is outside of the 95% confidence interval for \(\mu\) (or equivalently, if \(\mu\) is outside the 95% confidence interval for \(\bar{x}\)), then if the null was true, there would only have been a 5% chance of getting an \(\bar{x}\) as extreme as the one we observed. It’s still possible – 1 chance in 20 – but for whatever reason, statisticians a century ago decided 5% was a good benchmark. We might in fact be rejecting the null by mistake, just because we happened to get a \(\bar{x}\) that was especially large or small – and in fact, if the null is true, we will incorrectly reject it 5% of the time just by sheer chance. But at least this threshold means that, most of the time, we won’t be rejecting it by accident, but because the truth really is inconsistent with the null.
Null hypothesis is true | Null hypothesis is false | |
---|---|---|
We reject the null | False positive (Type 1 error) | OK |
We fail to reject the null | OK | False negative (Type 2 error) |
The table above illustrates the relationship between test results and the truth. If the null hypothesis is true but we reject it in favor of something else, then we have made a mistake – this is known as a false positive, or Type 1 error. For instance, if a drug seems to work better than a placebo, but we thought that just because the subjects in our sample just by chance got a bit better than those taking the placebo, then we have incorrectly concluded that the drug is better than nothing, and have made a false positive / type 1 error.
Conversely, if the null hypothsis is false but we fail to reject it, then we have also made an error – a false negative, or type 2 error. This can often happen if our test is weak, eg, if our survey is small. Returning to our earlier example, if the population mean is 0 and our sample deviation is 2 but we only surveyed 3 people, our standard error will be very large and we will fail to reject the null, in this case erroneously due to the fact that we didn’t survey enough people.
The other two cells in the table are when the right thing happens: we reject the null when it’s false, or fail to reject it when it’s true. The 95% threshold was basically designed to avoid Type 1 errors, even at the cost of greater Type 2 error: much better to mistakenly fail to reject the null than to mistakenly reject the null. That’s because we want science to be conservative: to only reject hard-won beliefs if the evidence is very strong against them. We only reject the null if there is a 5% or less chance that our data are consistent with the null being true; ie, we only reject the null if we think there is a 5% or less chance that we are making a Type 1 error.
Moving beyond the 95% threshold or 95% confidence interval, we can be a bit more precise about just how unlikely our test data are assuming the null is true. For instance, assuming \(\mu = 0\) and \(se = 1\), what’s the chance of getting a \(\bar{x}\) of 2 or greater? How do we calculate this in R?
pnorm(2,0,1,lower.tail=F)
[1] 0.02275013
(Note how lower.tail=F
gives us 1-p, ie, the area above 2, rather than the default area below 2. Also note that I am using the normal distribution here rather than the t. In usual practice, it is better to use the t and specify your degree of freedom, but we can use the normal for simplicity here, which is equivalent to assuming our \(n\) is large.)
This is what’s known as the p-value. It’s not the probability of getting exactly 2 assuming that the null is true – that probability is 0 (since getting a \(\bar{x}\) of exactly 2.000000 is very unlikely). Rather, it’s the probability of getting something as large as 2 or larger.
But this is kind of a weird idea: what if instead of a mean of 2, we got a mean of -3? Well, that would be even more unlikely, but we might fail to reject the null if we are only fixated on values greater than 0 due to our research hypothesis that \(\mu > 0\). This is what’s known as a one-tailed test: You are only interested in \(\mu > 0\) (or only interested in \(\mu < 0\)), and you only ask yourself: “What is the probability of getting something as large as this assuming the null is true?” (or as small, but not both).
Some people use one-tailed tests, but they are generally a bad idea, for exactly the reason we just stated: what if you get a really unlikely sample statistic but in the opposite direction of what you expected? Do you really fail to reject the null? No. Much better to do a two-tailed test, which says that we reject the null if we get something extreme in either direction. Specifically, we reject the null if the probability of getting the data we got (assuming the null is true) is \(\leq 0.05\).
In the two-tailed case, our p-value will not be 0.02275
– that’s the probability only of getting something as high as 2 or higher. But the two-tailed test allows for low extremes as well. Thus the chance of getting something as extreme as 2 (or -2) in either direction is twice that, or 2*0.02275 = 0.0455
.
It might seem like a two-tailed test is an easier bar to pass than a one-tailed test, since the data can be extreme in either direction in order to reject the null. But actually, that’s not the case. For either test, we are interested only in statistics that are less that 5% likely to happen by chance. Let’s return to our example:
In our one-tailed test example, we were only interested in values \(> 0\). Given a mean of 0 and a standard error of 1, what is the lowest \(\bar{x}\) value such that the probability of getting \(\bar{x}\) is < 0.05?
qnorm(.95,0,1)
[1] 1.644854
(That is, we want an x value such that 95% of the area is below it; and again we are using the normal rather than the t for simplicity.) So anything above 1.645 is sufficient to reject the null. The figure below illustrates the rejection region for a one-tailed (positive) test; any value within this region has less than a 5% probability of occuring by chance alone.
But now what about a two-tailed test? In this case, we only reject the null if we get something extreme in either direction – ie, only if we get something with a probability of 5% or less. But since either direction is now possible, the area is split between the tails, and thus each is smaller. Since we would reject the null with either a small or a large number, we have doubled our chance of rejecting the null, and thus each half much be half as large to make sure the total probability is still only 5%. See the figure for an illustration of this:
Now what is our threshold? We have two: one positive, and one negative. In fact, this is exactly the same as the calculation of the 95% Confidence Interval! We want 2.5% on the top, and 2.5% on the bottom. Thus our two thresholds are:
qnorm(.025,0,1)
[1] -1.959964
qnorm(.975,0,1)
[1] 1.959964
So a two-tailed test is actually a bit harder to pass, especially if have a guess which way your result is going to go. Since scientists get much more attention when they reject the null rather than “fail” to reject it, it’s often tempting to use a one-tailed test, with its threshold of 1.65 standard errors (for large \(n\); the threshold will be higher for smaller \(n\)), rather than the higher threshold of 1.95 se’s for the two-tailed test. But that temptation is essentially just hoping for a False Positive in your favor, which is bad science.