Overview
This lesson introduces the fundamental concept of sampling.
Objectives
After completing this module, students should be able to:
Reading
Schumacker, Ch 4-6. Lander, 15.1.
One of the most fundamental goals in statistics is to make claims about unknown populations using smaller samples from those populations. Thus in a national survey, we randomly select from the population of everyone in the country a small subset of them, and use measurements of that sample to make inferences about the true, unknown population characteristics.
This is the job of statistics: moving from a sample and some mathemtical measure of it (such as the sample average) to a characteristic of the population (such as its true mean). Thus we might want to know the average height of a male US resident; we sample 1000 US males at random, and take the average of their heights. This measurment of our sample is called a sample statistic, and we might denote it \(\bar{x}\) – the average of all our sample measurments \(x_i\). Ie, \(\bar{x} = 1/n \sum_{i=1}^n x_i\).
But of course our sample is not the same thing as measuring everyone in the country. The true mean height we might denote \(\mu\), but this is something we will never know. We know that \(\bar{x}\) is probably close to it, and perhaps we suspect that the more people we sample, the closer \(\bar{x}\) is likely to be to \(\mu\), but it will never be exactly it – or at least, we can’t be sure. But how do we know how close we are? What claims can we confidently make about \(\mu\)? That is what we need statistics for.
In general, what we want is to make inferences about the truth – the true, unknown population – based on what we do know, a small sample of the truth. The truth, as you might have noticed, is usually in greek, whereas our estimators are in latin script. There are lots of parameters (population characterisitics) we might want to know about the population: its mean \(\mu\), its variance \(\sigma^2\) or standard deviation \(\sigma\), its median, etc. For each of these we have an estimate based on a sample, and if we assume our sample is random – a big assumption! – we can make precise claims about the population based on our sample statistics.
For instance, we if assume the population is normally distributed and our sample is random, then our best guess about the population mean \(\mu\) (which we don’t truly know) is our sample average \(\bar{x}\). But we can be more precise than that. Unsurprisingly, the larger the size \(n\) of our sample, the more close \(\bar{x}\) will usually be to \(\mu\).
To be precise, we know that the population mean is \(\mu = \bar{X}\), where by definition \(\mu = \bar{X} = \sum_{i=1}^N x_i\), and the big \(N\) means we are summing over the whole population and not just a sample of size \(n\) (eg, 300 million people instead of 1000). We also know that the population standard deviation is by definition
\[\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}\]
where in accord with our intuition about the meaning of \(\sigma\) as the “spread” in the distribution, the more distance there is between each \(x_i\) and the mean \(\mu\), the bigger \(\sigma\) is and the more spread out the distribution is.
We also know the standard deviation of our sample, where the equation is pretty similar to that for the population:
\[s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}\]
It’s divided by \(n-1\) rather than \(n\) for reasons we will go into later (related to “degrees of freedom”) but otherwise it’s the same idea. However, this doesn’t tell us much about the population, since of course our sample standard deviation depends on how many samples we took (\(n\)), which has nothing to do with the true population size \(N\).
So apart from making \(\bar{x}\) (based on the sample) our best guess for \(\mu\), what else can we say about how sure we are about our guess? How do we incorporate the fact that presumably with a larger simple, our guess \(\bar{x}\) will be closer to \(\mu\)?
This is where we are rescued by the Central Limit Theorem. The Central Limit Theorem says that for a wide class of statistics like \(\bar{x}\), the errors in those estimators (how far they are from the true \(\mu\)) will be distributed normally, and we can specify exactly what the standard deviation of those errors is – known as the standard error.
What does this mean? It means that if were to take a sample of, say, 5 people from our population, calculate \(\bar{x}\), and compare it to the truth (if we knew it), we would find that it was near \(\mu\) but not exactly right. Now say we do it again: draw another 5 people, calculate \(\bar{x}\), and record that. Now repeat that a thousand times, getting a thousand different \(\bar{x}\)’s. These \(\bar{x}\)’s will cluster around the truth, \(\mu\), and be distributed normally. Amazingly, this is true even if the true population is totally non-normal or we don’t know anything about how it is shaped.
To see this for yourself, try the very nice simulator here: http://onlinestatbook.com/stat_sim/samp_dist_js/index.html
At the top, you can draw with your mouse any population distribution you choose, such as the one I’ve drawn. This is the true population, which in general we don’t know, but we do for the purposes of this simulation. In the middle, if you click “Animated” it will draw five samples from this population, and in the lower area plot the mean of those five. Note that this \(\bar{x}\) is near the true mean (the thin blue line below the population distribution at the top), but not exactly at it. If you click “5” in the second panel, it will show (without animation) the means of five 5-item samples stacked in the lower panel. And if you hit “10,000,” it will draw 10,000 5-item samples and plot all the \(\bar{x}\). And with that large set of samples, you will now see how the \(\bar{x}\)’s are all normally distributed around the true \(\mu\) (the thin blue line). This works with any population distribution you can draw at the top, as long as you have enough samples.
Of course, in reality we usually only take one sample of \(n\) items, with a single \(\bar{x}\) and a standard deviation \(s\). But statistics is all about quantifying error. Given our single sample, what can we say about \(\mu\)?
Luckily, the central limit theorem doesn’t just tell us that the \(\bar{x}\)’s are distributed normally around the truth; it also tells us the standard error of that distribution:
\[se = \frac{s}{\sqrt{n}}\]
Remember that \(s\) is the standard deviation of the sample, \(n\) is the size of the sample, and \(se\) is a measure of the spread of \(\bar{x}\)’s around the truth \(\mu\) – which we call the “standard error”.
Just as we previously expected, the error in \(\bar{x}\) in estimating \(\mu\) goes down with increasing \(n\) – the bigger the \(n\), the bigger the denominator and the smaller \(se\) is – ie, those \(\bar{x}\) cluster more tightly aroung \(\mu\). (Again, you can see this in the simulator: look at the spread of blue boxes at the bottom when you choose 100,000 repetitions: it’s much smaller than with 10,000.)
But again – we are only taking one sample, so what does this get us? With a little bit of algebra, quite a lot.
For instance, if the standard error is \(se\), we know that about 95% of all those possible \(\bar{x}\) are within about \(\mu + 1.96 * se\) and \(\mu -1.96 * se\) (we will make this more precise in the next lesson).
Ie, if we were to just take one of those \(\bar{x}\) results at random,
\[P(\mu - 1.96 se \leq \bar{x} \leq \mu + 1.96 se) \approx 0.95\]
And that is exactly what has happened: we took one sample which gave us a \(\bar{x}\), no different from any of the other \(\bar{x}\) we might have gotten with another random sample.
Now we can just rearrange the terms inside the \(P()\) parentheses (just as we do for equations; verify this for yourself!) to get:
\[P(\bar{x} - 1.96 se \leq \mu \leq \bar{x} + 1.96 se) \approx 0.95\]
This tells us now the inverse of what we just had. It says that, given a single sample of size \(n\), with a single \(\bar{x}\), we can conclude that the true \(\mu\) is 95% likely to be between \(\bar{x} - 1.96 se\) and \(\bar{x} + 1.96 se\).
This follows straight from the algebra. But in the same way, we can use all our knowledge of distributions now to say a lot of things about where \(\mu\) is likely to be based on \(\bar{x}\) and the \(se\).
In one stroke, we now have the basis for much of modern statistics. All we need is a random, unbiased sample, and we can use its characteristics – the mean, standard deviation, and \(n\) – to make a variety of claims abou the population as a whole.
For instance, the General Social Survey asks a sample of the US how many hours of TV they watch. There are 899 respondents, with a mean response of 2.865 hours, and a standard deviation of 2.617. From these data alone we can now make statements about the population of the entire country (presuming our sample was truly random).
If our best guess about the US mean number of hours watched is \(\bar{x} = 2.865\), what can we say about how accurate that guess is? Well, we know that \(P(\bar{x} - 1.96 se \leq \mu \leq \bar{x} + 1.96 se) \approx 0.95\), so our 95% confidence interval (CI) is
\[CI_{0.95} = [2.865-1.96*2.617/\sqrt{899},2.865+1.96*2.617/\sqrt{899}] = [2.694,3.036]\]
Ie, we’re 95% sure that the true population mean \(\mu\) is between 2.694 and 3.036. What does being 95% sure mean? It means that, if we were to run our survey 100 times, 95 of those times we would get a \(\bar{x}\) in that range.
There are a lot of reasons we could be wrong in our guess of 2.865, the most important of which is that we might not be surveying everyone at random (eg, we are more likely to get people who sit around at home watching TV and answering telephone surveys). The statistics only quantifies the error due to the random part, not any other biases or sources of error. In that sense, it places a lower bound on how accurate we can be: no matter how good our other survey methods, the limits of statistics place a bound on what we can know. On the other hand, by surveying just 0.00003% of the country, we can get a remarkably accurate measure of any quantity people are willing to tell us on the phone.
As in the last module when we worked with the normal distribution, we can run these calculations in a number of different directions depending on what information we have and what we want.
Example 1. Say you wanted to have a 95% CI that was much tighter, eg, plus or minus 0.1 hours. How many people would you have to survey, assuming the sample mean (2.865) and sample standard deviation (2.617) stayed the same?
We would need to solve for \(n\):
\[0.1 = 1.96*2.617/{\sqrt n}\]
Or \(n = 2631\) people. Make sure you see where we got that equation from: just the definition of the CI from the previous page.
Example 2. Returning to our original survey, how certain are we that the total number of hours watched is less than 3?
Well, our standard error is \(2.617/ \sqrt{899} = 0.087\). So we can just plug this into our normal distribution (or better, use R) to figure it out. To figure it out by hand, we would ask how far, in standard errors, 3 is from the mean: \((3-2.865)/0.087 = 1.55\). Then we would look up in our Z (normal) distribution table 1.55, to see what proportion of a normal population is below \(\mu + 1.55\). Or we could just use R: pnorm(3, 2.865,0.087) = 0.94
. That is, we are 94% confident that the total is less than 3 hours.
Remember that our 95% CI was \([2.694,3.036]\). If we plug \(3.036\) into pnorm
, we get pnorm(3.036, 2.865,0.087) = 0.975
. That’s not 95 though – what’s going on? Remember that the 95% CI contains 95% of the observations, but is summetrical, which means that 2.5% of the observations are not included on the right, and 2.5% of the observations are not included on the left. That is, the right bound of the 95% CI is at the 97.5 percentile mark, and the left bound is at the 2.5 percentile mark (97.5 - 2.5 = 95). It’s the same trick to keep in mind as in the previous module, just with standard errors now.