Overview
This lesson introduces the concept of correlation.
Objectives
After completing this module, students should be able to:
Reading
Lander, Chapter 15.2. Schumacker, Chapter 15.
In the last few modules, we have moved from descriptive statistics to statistical tests, motivated by the desire to test new theories against old ones and to build a statistically-grounded body of scientific knowledge. But so far, our tests have been somewhat limited in various ways: we can test a hypothesized mean against a null, or two or multiple means against each other, or two categorical variables for independence. But the results of these test tend to be fairly narrow: that means are different, or that categories are not independent.
In each of these cases, we do gain a bit more: a better guess for the population mean, a better guess for the expected difference between two groups; or in the case of the F test or chi-square test, we can look more closely at the individual group differences to determine which are likely to be responsible for a high test statistic. And as far as it goes, these results should not be dismissed: in particular, the difference between two groups is the foundation of experimental science. We care not just whether the treatment group is different from the control group, but also the about the size and direction of that effect: does the medicine help, and if so by how much.
But often our data, and our hypotheses, are more complex. What if the independent variable that affects our dependent variable is not categorical (treatment, control) but continuous (eg, a dollar amount)? And what if we have multiple independent variables affecting our dependent variable, as is so often the case in the real world? What if our dependent variable is categorical (eg, a college degree) but our independent variables are contiuous? Or everything is varying in time, not just across individual subjects? These are all both more complex, and more common, than many of the scenarios we have considered so far, especially if we are dealing with observational rather than experimental data.
We can organize our tests so far in terms of the types of variables we’ve been testing. Y traditionally denotes the dependent variable, and X denotes the independent variable or variables. Although we’ve generally put them in terms of the substantive questions we’ve wanted to test, each of the tests we’ve seen so far can also be categorized by examining how an independent X affects a dependent Y:
Var X | Var Y | Statistical Procedure | |
---|---|---|---|
Categorical (2) (eg, sex) | Continuous (eg, income) | Difference in means t test | |
Categorical (eg, sex) | Categorical (eg, religion) | \(\chi^{2}\) test | |
Categorical \((> 2)\) (eg, religion) | Continuous (eg, income) | ANOVA and F test | |
Continuous (eg, height) | Continuous (eg, income) | Correlation and Regression |
Thus for the t test, our variable of interest is divided into two groups according to some categorization, and we wish to know whether that categorization is correlated with our variable of interest. If we divide a bunch of people randomly, we would expect that the means (for any variable) between those two groups would not be statstically different: eg, two different random groups of people would not have significantly different mean incomes. But if we divide them up non-randomly – eg, by sex – and that categorical variable is not independent of the dependent variable, then we might see a difference in the two groups. In that sense, the independent variable – sex – has an effect on the dependent variable – income. Of course, “effect” implies that the former causes the latter, whereas it might in fact be that the reverse is true, or that some third thing causes both. But speaking a bit more loosely, we often than that X affects Y, or there is an effect of X on Y (note the a/e distinction!), when the two are not independent.
Similarly, we might say there is an effect of sex on religion if the two are non-indepedent according to the chi-square test, or an effect of religion on income if an F test finds that at least one of the income means for the religion groups is different. But perhaps the most common scenarios is when both our dependent and independent variables are continuous…
When both are continuous, we need another kind of test. So far, our independent variable (X) has always been categorical, and our tests have basically been whether our category means are different from each other. And if we want to go beyond the test to prediction, that too is simple: we just take the mean for each category, and that’s our best guess for the population means for these categories. But with continuous variables, we need something new. And indeed, there are a whole host of related questions we might want to answer.
For instance, say we are interested in the effect of height on income. The first question is the basic test: are the two independent of each other, or not? But furthermore, we might also want to know:
We will tackle each of these questions soon, although the fourth we will discuss at greater length in the next module.
The most basic descriptive measure of the connection between two continuous variables is their correlation. The best way to visualize the correlation between two variables is to plot them, X (horizontal) vs Y (vertical), one point for each observation. Correlated variables tend to cluster in a narrow band, while uncorrelated variables tend to be diffused in a round cloud. With correlated variables, you can accurately predict Y with X, whereas with uncorrelated variables, X gives you no information about Y, and any variation in X is unrelated to variation in Y.
Here is a chart of a variety of scatter plots, along with their associated correlations:
Note that correlation can range between 1 and -1, where both 1 and -1 are strongly correlated variables, and 0 is uncorrelated. 1 means that when X is higher than its mean, Y is as well; whereas -1 means that when X is higher than its mean, Y tends to be lower – but it is easier just to visualize this from the diagram.
Note also that correlation does not capture everything. For instance, the two varaibles in the bottom panels are all clearly related, but in all those cases, the correlation measure fails to pick up on that dependence. We will need more sophisticated methods for that, which we briefly touch on in the final segment of this course.
To calculate the correlation between two variables, we need a measure that is highly positive or negative when the two variables are positively or negatively correlated, but is at 0 when they are uncorrelated; and it should be bounded between -1 and 1 regardless of how large X or Y is. The basic idea is to follow the template for our calculation of the variance.
Recall that the sample variance (the square of the sd) is:
\[\textrm{Var}(y) = \frac{1}{n-1}\sum_{i} (y_{i} - \bar{y})^{2}\]
Ie, it is the sum of the squares of the differences in each \(y_i\) around the mean \(\bar{y}\). We can also write this as:
\[\textrm{Var}(y) = \frac{1}{n-1}\sum_{i} (y_{i} - \bar{y})(y_{i} - \bar{y})\]
The covariance between two variables is very similar to this:
\[\textrm{Cov}(x,y) = \frac{1}{(n-1) } \sum_{i} (x_{i} - \bar{x})(y_{i} - \bar{y})\]
When both \(x_i\) and \(y_i\) (eg, a person’s height and income) are above their respective means, the product \((x_{i} - \bar{x})(y_{i} - \bar{y})\) will be postive; when they are both below their means, the product is also positive. Thus when the two variables have postive covariance, this sum will be postive. Similarlty, when we tend to see \(x_i\) above its mean whenever \(y_i\) is below its mean, the product \((x_{i} - \bar{x})(y_{i} - \bar{y})\) will be negative, and thus the sum will be negative – a negative covariance. And when the two variables are unrelated, the sum will be a bunch of positive and negative numbers that average out to zero.
However, while the covariance is an important and essential concept that we will be returning to frequently, it’s not quite what we want for a correlation. The main reason is that, like the variance and the sd, its size depends on the units of X and Y – if Y is in dollars, the covariance will be larger than if Y is measures in thousands of dollars.
If we want something that will always range betwee -1 and 1, we need to rescale it. That’s what the correlation is (denoted \(r\) for the sample, and \(\rho\) for the population):
\[r = \frac{\textrm{Cov}(x,y)}{ s_{x} s_{y}}\]
It has the same sign and relative size as the covariance, but dividing \((x_{i} - \bar{x})\) by the standard deviation of X (\(s_x\)), and similarly for y, means that the sum will always be between -1 and 1. We won’t prove that here, but it’s very simple to verify if you write out the equation for \(s_x\) and \(s_y\).
To calculate the covariance or correlation in R is simple enough:
set.seed(1)
library(ggplot2)
x <- rnorm(100,3,2)
y <- rnorm(100,-1,5)
ggplot(data.frame(x=x,y=y), aes(x=x, y=y)) + geom_point()
cov(x,y)
[1] -0.008554794
cor(x,y)
[1] -0.0009943199
Unsurprisingly, our variables are uncorrelated in this case. But is -0.001 significantly different from 0? We clearly need a statistical test! But we’ll hold off on that until regression.
Let’s confirm our calculations using the formulas for covariance and correlation:
cov2 <- sum( (x - mean(x))*(y-mean(y)) ) / (length(x)-1)
cov2
[1] -0.008554794
cor2 <- cov2 / ( sd(x)*sd(y) )
cov(x,y)
[1] -0.008554794
cov2
[1] -0.008554794
cor(x,y)
[1] -0.0009943199
cor2
[1] -0.0009943199
Bingo.
Here’s a somewhat more interesting simulation that introduces some actual correlation. Note how we create Y as a function of X plus some independent noise: \(Y = X + \epsilon\). We’ll return to this formulation in the next lesson, but this is a standard way of thinking how a dependent variable Y forms as a function of X: some effect of X, plus some other stuff that’s independent of X (\(\epsilon\)).
y <- x + rnorm(100,10,2)
ggplot(data.frame(x=x,y=y), aes(x=x, y=y)) + geom_point()
cov(x,y)
[1] 3.295355
cor(x,y)
[1] 0.6635631
Note how the correlation is much closer to 1 – much more interesting now. But returning to our bullet points from above, how do we predict Y from X, or gauge the statistical signficance of these relationships? That’s a job for regression.