Overview

This lesson shows how to construct a simple script to verify the use of confidence intervals.

Objectives

After completing this module, students should be able to:

  1. Distinguish a non-normally distributed population from a normally distributed sample statistic.
  2. Write a script to sample from a population.
  3. Write a script to repeat that sampling, and evaluate the accuracy of the confidence intervals.
  4. Explain the statistical meaning of the confidence interval.

Reading

Schumacker, Ch 4-6.

1 Simulating sampling

Let’s write our own script in R to verify the Central Limit Theorem as we’ve been using it. What we want to do is similar to the online simulator in the previous lesson. First, we construct a population that we pretend (as survey scientists) not to know the truth about. Then we draw a sample from that population, calculate its mean and standard deviation, and use that to make inferences about the (pretend unknown) population.

For instance, if we claim that based on the 95% CI derived from our sample, we are 95% sure the true mean \(\mu\) is between A and B, then if we repeat the whole survey 100 times, do we find that the true \(\mu\) is actually between A and B in about 95 of those runs? Hopefully so!

The basic idea is that we want a couple loops inside each other. For the whole process, we’ll specify what the true population mean and distribution is. In the inner loop, we will be taking a 1000-unit sample from our population. We’ll calculate the mean, the standard deviation, the standard error, and the 95% CI for that sample. In the outer loop, we’ll just repeat that whole process 100 times, and for each run, we’ll record the mean and the lower and upper limit for the 95% CI (ie, three numbers). Then when we’re done, we’ll see whether the true population mean \(\mu\) is in fact in the 95% CI for each individual run 95% of the time. If so, then we have verified that (in this case at least) the standard error and CI process seems to work.

1.1 The population

We’ll construct and examine each of the pieces of the script in turn before we run anything.

First, what is our population? Let’s make it somewhat interesting – after all, the whole point of the Central Limit Theorem is that the original population doesn’t have to be normally distributed for the stastistic (\(\bar{x}\)) to be normally distributed around the truth.

So let’s imagine we have a beetle, say, where the males and females are of differnt sizes, but we’re interested in the overall average size of the beetle. There are equal numbers of males and females, and the males are 10mm with a standard deviation of 2mm, while the females are 20mm with a standard deviation of 5mm (both are normally distributed). What does the overall population look like?

1.2 Sampling from the population

Let’s take a big sample from our population and plot the histogram (and let’s assume our true population, as is the case with beetles, is approximately infinite in size).

# 1. Set our sample size
nsamples <- 10000
# 2. Create an empty vector to store our samples
sampler <- rep(NA,nsamples)
# 3. Our sampling loop
for(i in 1:nsamples){
  # 4. At random we get either a male or female beetle
  #    If it's male, we draw from the male distribution
  if(runif(1) < 0.5){
    sampler[i] <- rnorm(1,10,2)
  }
  #   If it's female, we draw from the female distribution
  else{
    sampler[i] <- rnorm(1,20,5)
  }
}
# 5. Check our results
head(sampler)
[1]  7.655140  5.317862 19.395948  9.180670 15.160586 22.523289

1.3 Visualizing the sample

Now if we want to plot it to take a look:

library(ggplot2)
ggplot(data=data.frame(sampler),aes(x=sampler)) + geom_histogram()

(Note that we have to make our sampler vector into a data frame to make ggplot happy.)

1.4 Calculating the sample mean and sd

As we can see, the beetle population is very non-normal due to the differences between males and females. But we can figure out the mean size and the overall standard deviation easily enough. The mean we can calculate directly: if there are equal numbers of males and females, then the mean of the total population must be the mean of 10 and 20, or 15. The standard deviation we could also calculate directly, but we’re not going to go into that here. Instead, we can just do it using our 10,000-beetle sample:

mean(sampler)
[1] 14.98135
sd(sampler)
[1] 6.276115

Great! Now let’s do that whole thing 100 times!

2 Repeat the sampling 100 times

The next step is to build the outer loop, where we specify how many times to run the whole thing, and where to store our summary statistics (\(\bar{x}\) and the 95% CI intervals) for each run.

Here’s the other loop, which we won’t run yet:

# 1. Set how many times we do the whole thing
nruns <- 100
# 2. Set how many samples to take in each run
nsamples <- 1000
# 3. Create an empty matrix to hold our summary data: the mean and the upper and lower CI bounds.
sample_summary <- matrix(NA,nruns,3)
# 4. Run the loop
for(j in 1:nruns){
  
  # 5. Insert the previous code that actually does the sampling in here.
  
  
  # 6. Finally, calculate the mean and 95% CI's for each sample 
  #    and save it in the correct row of our sample_summary matrix
  sample_summary[j,1] <- mean(sampler)  # mean
  standard_error <- sd(sampler)/sqrt(nsamples) # standard error
  sample_summary[j,2] <- mean(sampler) - 1.96*standard_error # lower 95% CI bound
  sample_summary[j,3] <- mean(sampler) + 1.96*standard_error # lower 95% CI bound
}

2.1 The whole simulation

Finally, here’s the whole code:

# 1. Set how many times we do the whole thing
nruns <- 100
# 2. Set how many samples to take in each run (1000 rather than the previous 10,000)
nsamples <- 1000
# 3. Create an empty matrix to hold our summary data: the mean and the upper and lower CI bounds.
sample_summary <- matrix(NA,nruns,3)
# 4. Run the loop
for(j in 1:nruns){
  sampler <- rep(NA,nsamples)
  # 5. Our sampling loop
  for(i in 1:nsamples){
    # 6. At random we get either a male or female beetle
    #    If it's male, we draw from the male distribution
    if(runif(1) < 0.5){
      sampler[i] <- rnorm(1,10,2)
    }
    #   If it's female, we draw from the female distribution
    else{
      sampler[i] <- rnorm(1,20,5)
    }
  } 
  # 7. Finally, calculate the mean and 95% CI's for each sample 
  #    and save it in the correct row of our sample_summary matrix
  sample_summary[j,1] <- mean(sampler)  # mean
  standard_error <- sd(sampler)/sqrt(nsamples) # standard error
  sample_summary[j,2] <- mean(sampler) - 1.96*standard_error # lower 95% CI bound
  sample_summary[j,3] <- mean(sampler) + 1.96*standard_error # lower 95% CI bound
}

2.2 Are our 95% CI’s correct 95% of the time?

Now we want to test it. We know the true population mean is 15. What proportion of the 95% CI’s across all of our runs actually include the true mean in their range?

counter = 0
for(j in 1:nruns){
  # If 15 is above the lower CI bound and below the upper CI bound:
  if(15 > sample_summary[j,2] && 15 < sample_summary[j,3]){
    counter <- counter + 1
  }
}
counter
[1] 95

Not bad! This means that the assumption that the \(\bar{x}\) are distributed normally with a standard error of \(s/\sqrt{n}\) is a good guide. It allows us to calculate 95% CI’s that are right about 95% of the time. When we have a small sample, our CI will be larger, which means on the one hand we are less sure about the true mean, but on the other hand, we are still 95% confident that that true mean is somewhere in our CI range.

As we will see in the next module, the reason that matters is that we want to make definite claims about the population. Is the average beetle size bigger than 13? Are the males truly smaller than the females, or did we just get a weird sample? These are the sorts of questions we will want to answer, and now we’re ready.