Overview
This lesson introduces the t distribution.
Objectives
After completing this module, students should be able to:
Reading
Schumacker, pp 106-112.
There is one last nuance to tackle before we can charge off into making and interpreting our own surveys, or moving on to the statistical tests of the next section.
It turns out that, when our sample size is small (less than 30, approximately), our sample statistics (such as \(\bar{x}\)) are not quite normally distributed around \(\mu\). It’s close, and as \(n\) gets above 30 it gets much closer, and it’s indistinguishable from the normal for \(n\) above 100. But lots of times we encounter samples on the small end, and we need to know how to deal with that.
The answer is the T distribution. It looks like the Z (normal), except that unlike the Z, it’s not always the same shape: there’s one additional parameter (in addition to the mean and standard deviation) that essential shifts how much of the mass is in the center versus the tails of the bell curve. That parameter is a function of \(n\): when \(n\) is small, we get essentially some bonus uncertainty, on top of the usual contribution via the standard error.
To calculate our statistic for some value of \(x\) we do the same as before: \((x-\bar{x})/se\). But now to get the percentile for that number, we need to use the t table rather the z table. Conversely, if we want to calculate the 95% CI, we can’t just use 1.96 – we need to figure out a number specific to our \(n\). That number is called the degrees of freedom, and it will recur in a number of places in the next few modules. It is often a function of \(n\), but the exact form of that function depends on the distribution we’re working with. In this case, the degree of freedom is \(n-1\), just as it was with the sample standard deviation.
The basic idea behind “degrees of freedom” is that our sample (or other data set) may actually have less independent information in it than the \(n\) items in the sample. For instance, if you have three items, a, b, and c, but c = a + b, then you really only have two independent pieces of information, since c can be deduced just from a and b. Similarly, the sample standard deviation has slightly less information in it because we use not \(\mu\) – which we don’t know – but \(\bar{x}\) to calculate it; but \(\bar{x}\) is derived from the data, and essentially (like c) uses up one bit of information, so we divide by \(n-1\), which makes the standard deviation a little bigger (ie, it increases our uncertainty) that it would have otherwise been.
Like a smaller \(n\), a lower degree of freedom generally means more uncertainty in your estimate. This is how it works in the T statistic as well: not only does a lower \(n\) directly increase uncertainty, it also changes the shape of the T distribution in ways that produce additional uncertainty on top of that.
Here is a picture of the T distribution. As you can see, it looks like the normal distribution, but gets “fatter” in the tails as the degree of freedom (sample size) gets smaller. That is, with a low degree of freedom, there is more of a chance of getting a value far from the mean; this also means that the 95% confidence interval will in general be wider, since you have to go farther out into the tails to encompass 95% of the population.
The t table is somewhat like the z table for the normal distribution, but since there is one additional parameter (the degree of freedom), you can’t just arrange things by score and percentile. Instead, what is usually done is that only a few percentiles are shown along the top, while the degree of freedom is shown along the left, and the t score is shown in the middle. Once again, you can go from score to percentile or vice versa (assuming you know the degree of freedom), as with the z table; but you have fewer percentiles to choose from.
This shows a portion of the table:
The percentiles are shown along the top, with the important difference that it shows not the xth percentile, but 1 minus that (for reasons we will get into in the next module). Thus the 95% confidence interval has, as you recall, 2.5% of the population in each of the tails (leaving 95% in the middle). With the normal distribution, the z score is about 1.96 – ie, if you go 1.96 standard deviations from the mean in both directions, you encompass 95% of the population. For the t distribution (when \(n\) is low), you have to go farther out, so the score will in general be larger than 1.96 for the 95% CI. To see this on the table, we look for \(t_{0.25}\), which corresponds to the 97.5th percentile. If our sample size is 8, then our degree of freedom is \(n-1\) or 7. So to get the corresponding t score, we look at row 7, \(t_{0.25}\), and get 2.365. This is considerably larger than 1.96 because \(n\) is so small; for such a small sample, we have to go out 2.365 standard deviations in either direction to encompass 95% of the population.
More concretely, if we have taken a small sample of the population – say, we have surveyed 8 people – our best guess about the population mean age (say) is still the sample mean. But we are less confident in our answer, both due to the low \(n\) (which creates a large standard error), and also due to the fact that our guess of \(\bar{x}\) is now distributed not quite normally, but via a t distribution with its fatter tails. Our 95% CI is now \(\bar{x} \pm 2.365*se\).
Referring back to the t table, what would be our 90% CI with a sample mean of 3, a standard deviation of 2, and a sample size of 4?
Make sure you know why each of the wrong answers was wrong. Think about if you had to grade this mini-exam: what error would a student be making for each of the wrong answers?
Of course, we can also use R instead of the table to go back and forth from percentile to scores (and thus to confidence intervals), using the built-in function qt
.
The format is the same as ever. For instance, to get the score for the 97.5 percentile with a degree of freedom of 7, we write:
qt(.975,7)
[1] 2.364624
And correspondingly, if we want the percentile for a score of 2.365 with a degree of 7, we write:
pt(2.365,7)
[1] 0.9750138
In this lesson we have discussed how a small \(n\) leads in additional uncertainty above and beyond the direct effect of \(n\) on the standard error. But this is also a correction we need to make when \(n\) is very large relative to \(N\) – that is, when our sample is close the entire population. Estimating our standard error for \(\bar{x}\) as \(s/\sqrt{n}\) will be wrong, since as \(n\) approaches \(N\), our standard error should drop to 0 – when we’ve sampled everyone, our estimate \(\bar{x}\) will be exactly the true mean!
This is easily corrected using the finite population correction, where we multiply the standard error by:
\[\sqrt{\frac{N-n}{N-1}}\]
Obviously, as \(n\) approaches \(N\), this quantity goes to zero, and thus our standard error goes to 0. In practice, \(n\) is usually much smaller than \(N\) (eg, 1000 people sampled out of 300 million in the US), which means that this “correction” will be close to 1, which is to say, our standard error will be the usual one without need of the finite population correction. For the purposes of the lessons in this course, for simplicity we will assume our samples are a small percentage of the total population, and proceed in our calculations (including homeworks!) without the correction. But one should bear it in mind if ever working with small populations.