Overview
This lesson introduces the concepts of \(R^2\) and coefficient significance.
Objectives
After completing this module, students should be able to:
Reading
Schumacker, Chapter 16.
Given our equation \(y = \beta_{0} + \beta_{1}x\) and our goal of finding the best-fit line, how do we judge (a) how well our line fits our data, and (b) whether this fit is statistically significant in some sense, given that as usual, we will always be able to come with some set of numbers, just as we did with sample means. How do we know the slope – the relationship between X and Y – is statistically meaningful, and not just a random number generated from the equations in the previous lesson, due only to the random noise inherent in any statistical process?
First let’s tackle (a), and come up with some measure of how well our line fits our data. As we know from our discussion of correlation, if the correlation is near 1 or -1 (thinking back to the figure in the first lesson), X predicts Y very precisely: no \(y_i\) value is very much above or below the line, and the points cluster tightly around it. Conversely, we might imagine a cloud of points with the same best-fit slope, but the points themselves are much more loosely bunched around the line: the slope is the same, but the correlation is lower. How do we measure this?
If we denote the correlation as \(r\), then we denote the fit from the regression as \(R^2\). This actually does equal \(r^2\) for the case of two variables, but it is worth discussing in slightly more general terms, since it does not equal \(r^2\) when there are multiple X variables (how could it?). But it is nevertheless a meaningful measure of fit for any multiple or bivariate regression.
\(R^2\) is essentially a measure of the proportion of the total variation in Y explained by X. If every time \(y_i\) is above \(\bar{y}\), \(x_i\) is too (by a proportional amount), then X explains a lot of the variation in Y; but if we are less able to predict \(y_i\) with \(x_i\), then less of the variation in Y is explained by X.
So how to we put this measure into mathematical terms?
First we define the total variation in Y, which is very much like the variance in Y and is called (for obvious reasons) the Total Sum of Squares or TSS:
\[TSS = \sum_{i} (y_{i} - \bar{y})^{2}\]
That is, the TSS is a measure of the total variation of Y around its mean, much like the variance.
Conversely, the Sum of Squared Errors or SSE is like the variance of our prediction, and should be familar from the end of the previous lesson: \[SSE = \sum_{i} (y_{i} - \hat{y}_{i})^{2}\]
Ideally, we might just be interested in the SSE, since it measures how close the various predictions \(\hat{y}_i\) are to the true values \(y_i\). But this measure by itself is not very useful, since it depends on the unit we’ve measured Y in, as well as on the amount of variation there already is in Y. If instead we want something more like the correlation \(r\), which ranges between -1 and 1 for any Y and X, and thus allows us to compare fits across different models, we need to rescale it as a proportion, just as we did in moving from the covariance to correlation. We thus define \(R^2\) as:
\[R^{2} = \frac{TSS - SSE}{TSS}\]
Unlike the correlation, this value can range only between 0 and 1. The total amount of error is the TSS, which is the most the model can explain. If the model is perfect, then \(\hat{y}_i = y_i\) for every i, and the SSE is 0. Thus the \(R^2\) is of course 1. If the model is terrible, then X explains none of the variation in Y, and basically we do no better than just guessing \(\bar{y}\) for any given \(y_i\); in that case, the SSE equals the TSS (\(\hat{y}_{i} = \bar{y}\)), and \(R^2\) equals 0. Thus the best we can do is 1, and the worst is 0.
\(R^{2}\) is also sometimes called the proportional reduction in error (ie, the reduction in the TSS). So if \(R^{2} = .3\), we say we have reduced the error by \(30\%\), or \(30\%\) of the variation in \(y\) is explained by the model. If \(R^2\) is 1, then we have reduced the error by 100% and explained all the variation, and if \(R^2\) is 0, we have reduced the error not at all, and explained none of the variation with our model.
With sample means, we always get some number, and if our null hypothesis was that the population mean was 0, we couldn’t just ask whether the sample mean was not equal to 0, since that would almost always be the case simply by chance, even if the true population mean really was 0. Similarly, with a regression, we will always get some value for \(\beta_1\), but we want to know whether that value is significant: does a change in X really produce a change in Y, or is it just a statistical illusion? We can’t just look at how big \(\beta_1\) is, because that depends on the units of X and Y – if we measure GDP in thousands, \(\beta_1\) is bigger than if we measure GDP in cents. (Make sure you understand why.) And while \(R^2\) is a better guide – presumably if \(R^2\) is near 1, then it can’t be by chance that X explains so much of the variation in Y – that’s still not a statistical test. We need something more precise.
Specifically, what we are fundamentally interested in is whether \(\beta_1\) (and \(\beta_0\) too) is significantly different from the null hypothesis, of no effect. No effect of X on Y means \(\beta_1 = 0\), so we are interested in testing whether \(\beta_1 \neq 0\), and the same for \(\beta_0\). We know how to do that! All we need is the standard error on \(\beta_1\).
Again, we’ll skip the derivation, but the equations for the standard error of \(\beta_0\) and \(\beta_1\) are:
\[se_{\hat{y}} = \sqrt{ \frac{\sum (y_i-\hat{y}_i)^2 }{n-2}}\]
\[se_{\beta_0} = se_{\hat{y}} \sqrt{ \frac{\sum x_i^2}{n \sum (x_i - \bar{x})^2}}\]
\[se_{\beta_1} = se_{\hat{y}} \frac{1}{\sqrt{\sum (x_i - \bar{x})^2}}\]
(The first equation is just there to help set up the second two.) Once we have our standard error, the test proceeds exactly as we’ve always done it, using a t test. For instance, if we got \(\beta_{1} = 0.31\) and a \(se_{\beta_{1}} = .10\), our t statistic is just \((0.31-0)/0.10 = 3.1\). The degree of freedom for a regression is \(n - k - 1\), where \(k\) is the number of variables; so in this case, it is \(n - 1 -1 = n-2\).
As usual, we almost always do a two-tailed test here (since who knows whether \(\beta_1\) will be positive or negative), so with an alpha of 0.05 and an \(n\) of 23, our threshold value is 2.08. Thus since \(3.1 > 2.08\) we can conclude that the effect of X on Y is statistically significant (ie, it is different from 0).
We can also present the 95% CI on the effect size: \(\beta_{1} = 0.31 \pm 2.08 \cdot 0.10 = 0.31 \pm 0.208 = [0.102, 0.518]\). As usual, there is uncertainty: we are 95% certain that the effect of a thousand-dollar increase in per-capita GDP is between 0.102 and 0.518 tons of CO2, with our best guess being 0.31 tons per $1K increase in per capita GDP.
Let’s return to our simulated dataset from the previous lesson, but tweak it just a little:
set.seed(1)
library(ggplot2)
x <- rnorm(100,3,2)
y <- 5 + 3*x + rnorm(100,0,10)
dat <- data.frame(x=x,y=y)
ggplot(dat, aes(x=x, y=y)) + geom_point()
Since we’ve created the data, we know what \(\beta_0\) and \(\beta_1\) are: 5 and 3. The third term in the y <- ...
line is the error \(\epsilon\), which is normally distributed, as is usually the assumption.
To estimate our model – ie, to regress Y on X – we simply run:
biv_model <- lm(y~x,data=dat)
summary(biv_model)
Call:
lm(formula = y ~ x, data = dat)
Residuals:
Min 1Q Median 3Q Max
-18.768 -6.138 -1.395 5.394 23.462
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6390 1.9826 2.34 0.0213 *
x 2.9947 0.5386 5.56 2.34e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.628 on 98 degrees of freedom
Multiple R-squared: 0.2398, Adjusted R-squared: 0.232
F-statistic: 30.91 on 1 and 98 DF, p-value: 2.335e-07
We can now interpret everything in this output. The “Residuals” is just a description of the errors (\(y_i - \hat{y}_i\)) for every i, and is not very interesting. More important is the stuff below “Coefficients:”. The first column are the values for \(\beta_0\) (the intercept) and \(\beta_1\) (x). The second column is the standard errors. The third column is the t statistic, which is just column 1 divided by column 2. As usual, if \(n\) is fairly large, a good rule of thumb is that if the t statistic is larger than about 2 (or smaller than -2), the coefficient is significant. The final column is the (two-tailed) p-value, which for instance for the \(\beta_1\), is just equal to 2*pt(5.56,98,lower.tail=F)
. (Make sure you understand this! The logic is exactly the same as with our earlier t tests.)
Note how the \(\beta_1\) coefficient is quite close to the truth, while the intercept is less close. The error on x is relatively small (due to how we set up the simulated data), while the error on the intercept is relatively larger, as reflected in their different t statistics. Of course, all these estimates are always uncertain, so we often report both the coefficient, and the 95% CI, which is calculated in the usual way.
Finally, the output also shows the \(R^2\), as well as the adjusted \(R^2\) and the F statistic, which we will discuss in the next module.
We can also easily plot our regression line using ggplot:
ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_smooth(method=lm, se=FALSE)
`geom_smooth()` using formula 'y ~ x'
This adds the regression line from lm
, without the standard error shown. If we want to show our uncertainty, we can plot:
ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_smooth(method=lm)
`geom_smooth()` using formula 'y ~ x'
The shaded area is essentially the 95% CI in the line. Our best guess is \(y = \beta_0 + \beta_1 x\), but remember that \(\beta_0\) and \(\beta_1\) are, like all sampling-based estimates, fundamentally uncertain, just as the sample mean is. The gray area combines the uncertainty in the intercept (ie, the line could be higher or lower) and the uncertainty in the slope (ie, the line could be steeper or flatter), giving the flared-ends shape that we customarily see. Remember, this is not a measure of the spread of the points, just a measure in the variation in the line that best fits the points. But our best point guess is always \(\hat{y}_i = \beta_0 + \beta_1 x_i\).
(In fact, for those who worry about such things, since we carefully distinguish between \(\mu\), the true population mean, and \(\bar{x}\), the sample mean, we should similarly carefully distinguish between the true \(\beta\) and our sample-based estimate, which we could call \(b\) (to reflect the greek/latin distinction), although more commonly it is denoted \(\hat{\beta}\) (like \(\hat{y}\)) to emphasize that it is the sample-based prediction for \(\beta\) rather than the true, unknowable population \(\beta\).)