Overview
This lesson explores some common probability distributions.
Objectives
After completing this lesson, students should be able to:
Reading
Schumacker, Chapter 5; Lander, Chapter 14.
Perhaps the most important tools in statistics are the various distributions used to model various phenomena. Each distribution presumes a different process by which data are generated, so the goal is often to ascertain whether the data generative process (DGP) that we’re interested in resembles one of the processes behind one of the known distributions, and if so, we can make inferences about the origins of that data using statistics based on the relevant distribution.
In this module we present a few of the more important distributions and the functions that generate them. But the key thing to bear in mind is the process that generates each distribution, and when we might want to apply it in the real world. There are dozens if not hundreds in the statistician’s toolkit, but a few of the most popular ones do by far the majority of the work – often to the point of being applied to processes which they don’t quite match, but where assuming they do makes our analysis much easier.
How do we know a coin is biased? We flip it a lot, and see whether we get too many heads or tails than we would expect. But of course we’re never going to get exactly 50% heads and 50% tails. So how many heads or tails is enough to be suspicious? The binomial distribution gives the probability of getting \(x\) events (eg, heads) out of a total of \(n\) trials, where each of the events we’re interested in has probability \(p\).
If we’re asking about, say, 3 heads out of four flips, we could get HHHT, HTHH, etc. Each one has the same probability: \(0.5^4\), but there are a lot of different ways of getting three heads and one tail. The binomial pdf is:
\[P(x;n,p) = {n \choose x}p^x (1-p)^{n-x}\]
The semi-colon separates the random variable, \(x\), from the parameters \(p\) and \(n\). For any given \(p\) and \(n\), we can derive the probability of getting \(x\) events for any \(x\).
\[P(x;n,p) = {n \choose x}p^x (1-p)^{n-x}\]
The \(p^x (1-p)^{n-x}\) on the right hand side is just the probability of getting any sequence. For instance, with three heads and one tail, x = 3 and n = 4. If it was a biased coin, \(p\) could be something other than 0.5, such as 0.1; if it was a fair die, \(p\) would be 1/6. The chance of getting two 6’s and one non-6 would then be: \(1/6 \cdot 1/6 \cdot 5/6\)
To get the total probability of any such sequence (such as three heads and a tail), we just add up the probability of each of these sequences: P(HHHT) + P(HTHH) + … etc, each of which are individually the same probability. \({n \choose x}\) just counts up the number of ways we could get three H’s and one T, and is read as “n choose x” and equals \(n!/x!(n-x)!\). You can read elsehwere about how that last combinatoric equation is derived, but it’s basically just a matter of counting up all the possible combinations of three H’s and one T (eg).
As usual, R has four functions for dealing with binomial distributions, analogous to the ones we saw for the uniform distribution: rbinom
, pbinom
, dbinom
and qbinom
.
What’s the probability of getting two 6’s out of three rolls of a die?
dbinom(2,3,1/6)
[1] 0.06944444
What’s the cumulative probability of getting three or fewer heads out of four flips?
pbinom(3,4,0.5)
[1] 0.9375
Flip a coin four times and give me the total number of heads:
rbinom(1,4,.5)
[1] 1
As usual, we can derive the dbinom
result from the rbinom
by repeatedly flipping our coin 4 times, and for each set of four flips, writing down the total number of heads, and plotting the histogram of those totals:
library(ggplot2)
headcounts <- rbinom(100,4,.5)
ggplot(data=data.frame(headcounts),aes(x=headcounts)) + geom_histogram(aes(y=..density..)) + xlim(-1, 5) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).
Strictly speaking, this is a probability mass function rather than pdf (since x is not continuous), but the previous figure gives an idea of the sorts of shapes one encounters. Here’s another version with a much larger sequence of flips:
library(ggplot2)
headcounts <- rbinom(10000,100,.5)
ggplot(data=data.frame(headcounts),aes(x=headcounts)) + geom_histogram(aes(y=..density..)) + xlim(-2, 102) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).
Note how the distribution approaches the bell-curve shape as the number of events goes up. As we will see, this is a general property, and one reason why the normal distribution (which we will turn to shortly) is so important.
Another important discrete distribution is the Poisson distribution. This is similar to the binomial distribution in that it measures the total number of events, but in this case, we assume that the events happen during some time (or space) interval, such as the number of emails you receive in a day or the number of times you say “I” in an hour.
As before, we have a pdf, and a cdf which describes the probability of seeing x or fewer events in the time period. There is one parameter, \(\lambda\), which is the average number of events one generally sees in the period. The pdf is:
\[P(x;\lambda) = \frac{\lambda^x e^{-\lambda}}{x!}\]
As before, we can plot it by generating samples using rpois
, where each sample is a total number of observed events:
randpois <- rpois(10000,1)
ggplot(data=data.frame(randpois),aes(x=randpois)) + geom_histogram(aes(y=..density..),binwidth=1) + xlim(0, 20) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).
randpois <- rpois(10000,5)
ggplot(data=data.frame(randpois),aes(x=randpois)) + geom_histogram(aes(y=..density..),binwidth=1) + xlim(0, 20) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).
For rare events (where the mean number is 1 or less), the distribution not bell-curved, but for larger mean values, it too resembles a bell curve. In fact, the poisson distribution reduces to the binomial when there are just two possible events.
As before, we can calculate the cdf and all the rest using the cognate functions: dpois
, ppois
and qpois
.
poisfun <- function(x){dpois(x,5)}
ggplot(data=data.frame(randpois),aes(x=randpois)) + geom_histogram(aes(y=..density..),binwidth=1) + xlim(0, 20) + ylab("density") + xlab("outcome") +
stat_function(fun=poisfun,color="yellow")
(Note that the function is spikey because it is only defined for integer values of x. In fact, plotting poisfun
in this way produces lots of warnings as ggplot tries to evaluate it for non-integer x values, because dpois is a probability mass function, not a density function.)
The normal distribution is the most important and ubiquitous distribution. Its ubiquity is a result of the central limit theorem, which we will discuss in the next section. But as you know, normal, bell-curve distributions turn up almost everywhere you see random variables in the real world.
The equation for the normal distribution is:
\[p(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} e^{\frac{-(x-\mu)^2}{2 \sigma^2}}\]
There are two parameters here: the mean \(\mu\) and the standard deviation \(\sigma\) (\(\sigma^2\) is the variance). The first described where the center of the distribution is, and the second describes how dispersed the distribution is.
As usual, we can generate a distribution by taking multiple draws using the R function rnorm
:
randnorms <- rnorm(10000,0,1)
ggplot(data=data.frame(randnorms),aes(x=randnorms)) + geom_histogram(aes(y=..density..)) + xlim(-4, 4) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing non-finite values (stat_bin).
Warning: Removed 2 rows containing missing values (geom_bar).
Here the mean is 0 and the standard deviation (sd) is 1. As a rule of thumb, for the normal distribution about 68% of observations appear within 1 sd of the mean, and about 95% of observations appear within 2 sd’s of the mean.
We can plot the normal distribution for various combinations of \(\mu\) and \(\sigma\), but the key thing to bear in mind is that they are all basically the same shape, just transposed and stretched.
normfun <- function(x){dnorm(x,0,1)}
ggplot(data=data.frame(x=c(-4, 4)),aes(x)) + ylab("density") + xlab("outcome") +
stat_function(fun=normfun)
normfun <- function(x){dnorm(x,1,0.25)}
ggplot(data=data.frame(x=c(-4, 4)),aes(x)) + ylab("density") + xlab("outcome") +
stat_function(fun=normfun)
Because this is a true density function, the value at any given \(x\) doesn’t itself tell us very much. What we are generally interested in is the probability that an even will be between some \(a\) and \(b\), or \(F[X \in [a,b]]\). This requires the cdf of the normal, which doesn’t have an equation we can write down, but as usual it can be calculated by generating the histogram and then just adding up the events below \(x\) to give us \(F[x]\) for each value of \(x\).
Luckily, R (and many others before!) has done the hard work:
pnorm(0,0,1)
[1] 0.5
pnorm(0,1,0.25)
[1] 3.167124e-05
This gives the proportion of events less than 0 for \(N(0,1)\) and \(N(1,0.25)\), to use the usual abbreviation, where \(\mu\) is the first number and \(\sigma\) is the second.
We can also approximately plot the cdf using the pnorm function, to get a sense of how it looks:
cumlfun <- function(x){pnorm(x,0,1)}
ggplot(data=data.frame(x=c(-4, 4)),aes(x)) + ylab("cumulative distribution") + xlab("outcome") +
stat_function(fun=cumlfun)
As you can see, half the events are below 0, and nearly all are below 2 (ie, 2 standard deviations above 0, since sd = 1 in this case.)
Distribution | Parameters determining shape of PDF | Cumulative probability defined by input \(x\) |
---|---|---|
Uniform | 2: upper and lower bounds | Probability of drawing a number \(\leq x\) |
Binomial | 2: probability of some event; total number of trials | Probability of observing \(\leq x\) events in \(n\) trials |
Poisson | 1: average number of events in a period of time | Probability of observing \(\leq x\) events in a period of time |
Normal | 2: mean; sd | Probability of drawing a number \(\leq x\) |
All four R functions, r–, p–, d–, q– take in one main input, plus a number of extra parameters that define the shape of the pdf being used.
Each R function is basically a way of calculating with the height or area of the pdf that’s defined by the extra parameters.
dnorm(x,mean,sd)
: Give me the height of the pdf at x. This is just the PDF itself, which is just an equation where you input an x and get an output of p(x), the height of the curve at x.
rnorm(n,mean,sd)
: Give me n random numbers drawn from the distribution.
pnorm(q,mean,sd)
: Give me the probability (P) that a random draw from the distribution will be less than q (some x value). Equivalently, give me the area under the curve up through point q. This is the CDF, or F(x).
qnorm(p,mean,sd)
: This is the inverse of pnorm, as you can see from the p/q reversal: Give me the q (an x value) such that the area under the curve up to point q is equal to P. Equivalently, give me the q (x) such that there is a p% chance of drawing a random number less than q.
Here is a plot to help you visualize how these all interrelate using the normal distribution, but the same logic holds for any pdf.