Overview

This lesson introduces probability distributions.

Objectives

After completing this lesson, students should be able to:

  1. Calculate and use discrete distributions.
  2. Sample from distributions using R functions.
  3. Calculate probability density functions and cumulative distribution functions.
  4. Understand the uniform distribution.

Reading

Schumacker, Chapter 4.

1 Discrete distributions

The outcome of a roll of a die, a flip of a coin, or any other random event is known as a random variable. We usually denote this variable with a capital letter such as X, and for our die, we might write \(P(X=2) = 1/6\). Since in this case X can take on six different values, we might want to describe the probability of all those value in a single function, the probability mass function. It is easy to depict this function graphically:

That is, the chance of getting each of the X values is 1/6. The key thing of course is that these probabilities must all sum to 1. For instance, here is the probability mass function for a strange three-sided die:

While this is not a fair die and favors 3, it is a legitimate random variable, and the probability of its possible outcomes sums to 1.

2 Sampling

The sample function in R takes as its input any vector or list, and randomly selects items from this list. For instance, if we have a die with six sides numbered 1-6, we can sample two of these at random using sample:

die1 = c(1,2,3,4,5,6)
sample(die1, 2)
[1] 4 1

By default, sample assumes “without replacement” – ie, once an item is sampled, it can’t be sampled again. But what we usually want is “with replacement” – each roll of the die is separate, and you don’t use anything up. Thus we can simulate 10 rolls of our die by taking 10 samples with replacement:

die1 = c(1,2,3,4,5,6)
sample(die1, 10, replace=TRUE)
 [1] 3 4 2 3 1 4 2 6 6 6

2.1 Probability mass function

We can generate our probability mass function – ie, the probability of getting each of our possible outcomes – by simply rolling our die many times and plotting the proportion for each outcome.

library(ggplot2)
sout <- sample(die1, 1000, replace=TRUE)
ggplot(data=data.frame(sout),aes(x=sout)) + 
  geom_histogram(aes(y=..count../sum(..count..))) + 
  xlim(0, 7) + ylab("density") + xlab("outcome")

(The “aes” option for geom_histogram tells ggplot we want proportions and not counts on the y axis.)

2.2 Example calculation

We can also do the same thing for our previous example, where 1 was the outcome 20% of the time, 3 was the outcome 50% of the time, and 7 the outcome 30% of the time. The easiest way to do this is to create a 10-sided die with two 1s, five 3, and three 7s, and then sample from that:

die2 <- c(1,1,3,3,3,3,3,7,7,7)
sout <- sample(die2, 1000, replace=TRUE)
ggplot(data=data.frame(sout),aes(x=sout)) + 
  geom_histogram(aes(y=..count../sum(..count..))) + 
  xlim(0, 8) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).

3 Probability Density Functions

So far, our probability mass functions have been discrete – you can get a 1 or a 2 as an outcome (say), but not all the points in between. But we are often interested in random variables with continuous values: a random number between 0 and 1, or a random direction, distance, height, etc.

Intuitively we know what it means to choose a number between 0 and 1 at random, but in practice and theory it’s a bit more complex. Any given number – such as 0.1232342524534 – has essentially a 0% chance of being selected, but at the same time, we know that we have a 50% chance of getting a number between 0 and 0.5, and a 10% chance of getting a number between 0.8 and 0.9.

To conceptualize this, instead of the previous probability mass function with spikes at each outcome whose heights must add up to 1, we instead have a probability density function (pdf) which is continuous and whose area must sum to 1. Unlike the height of the spikes, the height by itself of this function doesn’t tell us much: instead, we’re interested in the area, such that in our previous example, the area between 0.8 and 0.9 should add up to 0.10, and the total area should add up to 1.

3.1 Uniform distribution

As you might expect, R has a function for generating a random number between A and B, where any number between A and B is equally likely:

runif(1,0,1)
[1] 0.5432848

The first argument is the number of numbers to generate, the second is the lower bound, and the third is the upper bound. As before, we can generate a large number of these and plot the histogram to see our pdf:

randunifs <- runif(10000,0,1)
ggplot(data=data.frame(randunifs),aes(x=randunifs)) +  geom_histogram(aes(y=..density..)) +  xlim(-1, 2) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).

This is what’s known as a uniform distribution: everything between A and B is equally likely.

3.2 Visualization

Rather than shaded in, pdfs are often drawn as functions, although you can always imagine them as idealized histograms. Thus the true pdf for the uniform distribution between 0 and 1 – sometimes abbreviated U[0,1] – is shown below:

uniformfun <- function(x){ifelse(x>=0&x<=1,1,0)}
ggplot(data=data.frame(x=c(-1, 2)), aes(x)) + stat_function(fun=uniformfun)

(To create this graph, first we create our uniform function, which is 1 between 0 and 1, and 0 elsewhere. Then we use the stat_function option in ggplot2 to add the function to the ggplot, where the data frame is just a holder setting the x range.)

3.3 Cumulative distribution function

The probability that a random number drawn from this distribution is less than some \(x\) is known as the cumulative distribution function or cdf. For the uniform pdf, you can see that the chance that some number drawn from it is less than 0 is 0; the chance that some number drawn from it is less than 0.25 (for instance) is 25% or 0.25, and the chance that some number drawn from it is less than 1 is 100%, or 1. Thus while the pdf of U[0,1] is a line of y=1 for all x between 0 and 1 and y=0 elsewhere, for the cdf of this uniform pdf, the chance than a drawn is less than x is x, which we can draw as y = x for \(x \in [0,1]\), y = 0 for \(x \leq 0\) and y = 1 for \(x \geq 1\).

uniformcdf <- function(x){ifelse(x>=0&x<=1,x,ifelse(x<0,0,1))}
ggplot(data=data.frame(x=c(-1, 2)), aes(x)) + stat_function(fun=uniformcdf)

(Note how we nested the second ifelse within the first in defining our cdf. There are plenty of orther ways to do it though.)

3.4 PDF vs CDF

While the pdf gives you the density for any given x, denoted \(pdf(x)\) or simply \(p(x)\), the cdf is much more straightforward: it just gives you the probability that you will get a draw less than x, which is equal to \(cdf(x)\), sometimes abbreviated \(F(x)\).

Again, this is also equal to the area under the pdf to the left of x, as we discussed with the uniform pdf. As you may recall from calculus, this area is equal to integral of pdf(x) up to x, or:

\[P(X < a) = F(a) = \int_{- \infty }^a p(x) dx\]

(Note how we use capital P or F for a probability and lowercase p for a probability density function; capital X for the random variable, and lowercase x and a for specific values of that random variable.)

The nice thing is that often we are interested in the chance of getting a draw between a and b. We know from before that the chance of drawing a number from U[0,1] that’s between, say, 0.8 and 0.9, is 0.1 (since it’s 10% of the interval). But we can also get this using the cdf, since the chance of getting something between a and b is just the chance of getting something less than b, minus the chance of getting something less than a. That is:

\[P(X \in [a,b]) = F(b) - F(a) = \int_{- \infty }^b p(x) dx - \int_{- \infty }^a p(x) dx = \int_{a}^b p(x) dx\]

3.5 Calculating with a CDF

That makes it look much more complicated than it is though. Really all we do is use the second term, \(F(b) - F(a)\). That is, looking at our previous figure, if we want the probability that a random draw will be between 0.345 and 0.234, say, we just calculate \(F(0.345) - F(0.234)\), that is, the y value at 0.345 minus the y value at 0.234, which for this super-simple function is just

\[F(0.345) - F(0.234) = 0.345 - 0.234 = 0.111\]

Ie, there is a 11.1% chance that our random draw will be in that range.

In some ways, the U[0,1] distribution is a bit more confusing than some, because \(F(x) = x\) (between 0 and 1). But say we have a uniform chance of drawing a number between 1 and 3. In that case, we have a pdf that looks like this:

randunifs <- runif(10000,1,3)
ggplot(data=data.frame(randunifs),aes(x=randunifs)) +  geom_histogram(aes(y=..density..)) +  xlim(0, 5) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).

3.6 Uniform with different bounds

The pdf is now only half as high, because the total area (total probbility) has to still sum to 1, and we’ve increased the width by a factor of 2. Once again, it is easy enough to plot the true pdf and cdf, and to see that now the chance of getting a draw less 2 is 0.5.

uniformfun <- function(x){ifelse(x>=1&x<=3,1/2,0)}
ggplot(data=data.frame(x=c(0, 4)), aes(x)) + stat_function(fun=uniformfun)

uniformcdf <- function(x){ifelse(x>=1&x<=3,(x-1)/2,ifelse(x<1,0,1))}
ggplot(data=data.frame(x=c(0, 4)), aes(x)) + stat_function(fun=uniformcdf)

4 R functions: Draws from the PDF

We can also get the pdf, cdf, and the inverse of the cdf using built-in R functions similar to runif:

To get the density (pdf) we use dunif, which as with runif takes as its default range [0,1] but that can be changed by the second and third arguments:

dunif(-1)
[1] 0
dunif(.5)
[1] 1
dunif(2.5,1,3)
[1] 0.5

In general, the equation for the pdf of a uniform distribution \(U[a,b]\) is:

\[p(x; a,b) = \frac{1}{b-a} \textrm{for all } x \in [a,b] \textrm{, = 0 otherwise}\]

4.1 R: Probability from CDF

To get the cdf, we use punif, which again, is the probability that a draw will be less than x:

punif(-1)
[1] 0
punif(.5)
[1] 0.5
punif(2)
[1] 1
punif(2.5,1,3)
[1] 0.75

(Make sure you understand all these results!)

4.2 R: CDF value from probability

And finally, to get the reverse of the cdf we use qunif, which given an input probability x, gives us the value of \(a\) such that there is an x% chance of getting a draw less than \(a\).

qunif(0.5)
[1] 0.5
qunif(0.5,1,3)
[1] 2

As we will see, there are many different probability distributions of interest, and many of them have similar built-in R functions akin to the four described for the uniform distribution.