Overview

This lesson shows the derivation of the regression coefficients, and a number of additional regression-related methods.

Objectives

After completing this module, students should be able to:

Derive the multiple regression coefficients, \(R^2\), adjusted \(R^2\), and F statistic.
Determine how to select variables, and the advantages and drawbacks of the existing methods.

Reading

Lander, Chapter 16; Schumacker, Chapter 17 through p. 402.

1 Derivation

The derivation of the regression coefficients \(\beta_0\), \(\beta_1\), \(\beta_2\) … is quite straightforward using matrix algebra. As before, we assume that each observation \(y_i\) is a function of an intercept plus the X variables: \(y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + ... \epsilon\). Using the same notation as before, we have our Y data as:

\(y_1\)
\(y_2\)
\(y_3\)
…

Ie, a vector of Y values. Our X variables can be viewed as a matrix, where each column is a different variable.

\(x_{1,1}\) \(x_{1,2}\) …
\(x_{2,1}\) \(x_{2,2}\) …
\(x_{3,1}\) \(x_{3,2}\) …
…

And of course this is essentially how the data is represented when it’s stored as a data frame or CSV file. Thus in matrix notation, we have

\[Y = X \mathbf{\beta} + \epsilon\]

Where \(\mathbf{\beta}\) is now a vector of all the \(\beta_k\) values, and \(X \mathbf{\beta}\) is a matrix (X) times a vector (\(\mathbf{\beta}\)) using the matrix multiplication we saw in the first module. The one variation is that the matrix \(X\) also includes a single column of 1s at the front, which gets multiplied by \(\beta_0\), the intercept, since that is not multiplied by any X value.

So for any given row (observation) \(i\) in our data, we have \(y_i = X_i \cdot \mathbf{\beta} + \epsilon = 1 * \beta_0 + x_{i,1} * \beta_1 + x_{i,2} * \beta_2 + ... + \epsilon\), where the \(\cdot\) in \(X_i \cdot \mathbf{\beta}\) is the inner (“dot”) product of row \(i\) of X and the \(\mathbf{\beta}\) vector (again, see the first module for a refresher).

Thus far this is just notation: a quick way of writing Y as a function of a bunch of X variables. But we can now use it to solve for what we want to know – the \(\beta\) coefficients.

1.1 Solving for \(\beta\)

We start with the assumption that Y is a linear function of the X variables (plus some noise):

\[Y = X \mathbf{\beta} + \epsilon\]

What we want to know is \(\mathbf{\beta}\). This can be solved with some simple matrix algebra. The key elements to remember are that \(A'\) is the transpose of A, which is what you get when the columns and rows have been swapped. And \(A^{-1}\) is the inverse of A, which is like the inverse of some number. If \(a^{-1} = 1/a\), we know that \(a * a^{-1} = 1\); similarly with matrices, \(AA^{-1} = \textbf{1}\), the identity matrix. So we solve our equation almost exactly as we would if we were using regular scalar numbers instead of matrices:

Again, start with:

\(Y = X \mathbf{\beta} + \epsilon\)

Now multiply both sides with \(X'\):

\(X'Y = X'X \mathbf{\beta} + X'\epsilon\)

We know that \(X'\epsilon = 0\), because we assume that the noise \(\epsilon\) is uncorrelated with X. Why that assumption of non-correlation means that \(X'\epsilon = 0\) is beyond what we will cover here, but the basic idea is that when we add up all the \(x_i*\epsilon_i\) values, they will average to 0 since the \(\epsilon_i\) values are random. So taking \(X'\epsilon = 0\) on trust, we then have:

\(X'Y = X'X \mathbf{\beta}\)

Now if we multiply both sides by \((X'X)^{-1}\), the inverse of \(X'X\), we get:

\((X'X)^{-1}X'Y = (X'X)^{-1}X'X \mathbf{\beta}\)

And since \((X'X)^{-1}X'X = \textbf{1}\) by the definition of the inverse, we thus have:

\((X'X)^{-1}X'Y = \mathbf{\beta}\)

And that’s our solution for all the \(\beta\) values, in one swoop.

1.2 Example

Let’s return to our previous dataset and see it in action.

anes_2008tr <- read.table("anes_2008tr.csv",sep=",",header=TRUE,stringsAsFactors=FALSE)
mr2 <- lm(ideology_con ~ age + gender_male + race_white + 
                          education + income,data=anes_2008tr)
summary(mr2)


Call:
lm(formula = ideology_con ~ age + gender_male + race_white + 
    education + income, data = anes_2008tr)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5159 -0.4137 -0.0326  0.6346  3.3294 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.680935   0.114812  32.061  < 2e-16 ***
age          0.005543   0.001456   3.807 0.000144 ***
gender_male  0.111990   0.053669   2.087 0.037026 *  
race_white   0.277814   0.055208   5.032 5.22e-07 ***
education   -0.073291   0.018011  -4.069 4.88e-05 ***
income       0.101125   0.026980   3.748 0.000182 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.272 on 2316 degrees of freedom
Multiple R-squared:  0.03202,   Adjusted R-squared:  0.02993 
F-statistic: 15.32 on 5 and 2316 DF,  p-value: 7.635e-15

Now let’s derive it using matrix algebra. Note that t(X) is R’s notation for \(X'\), and solve(X) is R’s notation for \(X^{-1}\).

xmat <- as.matrix(cbind(anes_2008tr$age,anes_2008tr$gender_male,anes_2008tr$race_white,anes_2008tr$education,anes_2008tr$income))
xmat <- cbind(1,xmat) # add the columns of 1's
head(xmat)

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1   35    1    0    5    4
[2,]    1   58    0    1    3    3
[3,]    1   39    0    1    6    4
[4,]    1   50    1    1    3    4
[5,]    1   72    1    1    6    4
[6,]    1   71    0    1    5    3

#now we solve for Beta in one step: (X'X)^-1 X'Y :
solve( t(xmat) %*% xmat )   %*%   t(xmat) %*% anes_2008tr$ideology_con

             [,1]
[1,]  3.680934918
[2,]  0.005543151
[3,]  0.111989722
[4,]  0.277814088
[5,] -0.073290632
[6,]  0.101124718

1.3 \(R^2\)

Of course, we haven’t done anything to derive the standard errors for each \(\beta\), and deriving that goes a bit beyond the scope of this course. But the variance of each \(\beta\) and the covariances between them can be readily derived with similar matrix methods.

The \(R^2\), on the other hand, can be as easily derived in the multiple case as in the bivariate case. Recall that \(R^{2} = \frac{TSS - SSE}{TSS}\), where the total sum of squares \(TSS = \sum_{i} (y_{i} - \bar{y})^{2}\) and the model sum of squared errors is \(SSE = \sum_{i} (y_{i} - \hat{y}_{i})^{2}\). Neither of these depend on X directly at all, and can be easily calculated once we generate our predicted values \(\hat{y}\).

We of course could generate \(\hat{y}\) by writing ypred <- 3.68 + 0.005 * age + ..., but luckily R has a built-in function for generating predicted values from simple models:

ypred <- predict(mr2)
# and the rest of it is done as we have done before:
y <- anes_2008tr$ideology_con
tss <- sum((y - mean(y))^2)
sse <- sum((y-ypred)^2)
r2 <- (tss-sse)/tss
r2

[1] 0.03201623

Again, the main point here is that the \(R^2\) measures how much of the variation in Y is explained by the model; it doesn’t care how many variables there are, just how well \(\hat{y}\) matches \(y\).

1.4 Adjusted \(R^2\)

Although \(R^2\) doesn’t care about how many variables there are in your model, that is actually a flaw in some ways. Theoretically, you could add dozens of variables to your model and, since each one would at worst do nothing to improve \(\hat{y}\) and might accidentally improve it, if you added enough variables you could get a very high \(R^2\) entirely by chance. This is called over fitting, and is an especially severe problem for complex models with many variables, as we will see in a few modules. But even for simple linear models like multiple regression, it can be a problem, leading to misleadingly large \(R^2\) values, when in fact most of the “explanatory” power of the model is due to pure chance.

To compensate for this, the concept of the adjusted \(R^2\) was developed, which basically penalizes the \(R^2\) for each additional independent variable. The idea is that unlike the \(R^2\), the adjusted \(R^2\) only increases when a new variable increases \(R^2\) more than would be expected by chance alone.

The formula for the adjusted \(R^2\) is similar to that for the \(R^2\), except with an adjustment for the degrees of freedom (ie, the sample size and number of variables):

\[\textrm{adjusted } R^2 = \frac{TSS/df_t - SSE/df_e}{TSS/df_t}\]

where \(df_t = n - 1\) and \(df_e = n - k - 1\) (where k is the number of variables).

Thus to return to our example, we can directly calculate the adjusted \(R^2\):

n <- length(y)
k <- ncol(xmat)-1
dft <- n - 1
dfe <- n - k - 1
(tss/dft - sse/dfe)/ (tss/dft)

[1] 0.02992646

Which matches the original result, and is smaller than the original \(R^2\). As it happens, it’s not much smaller, reflecting the fact that most of the variables included in the model do have explanatory power (as reflected by their individual p values) and aren’t just boosting the \(R^2\) by chance alone.

1.5 F test

Another way to measure the overall significance of the model is via an F test. In this case, our question is whether any of the \(\beta\) coefficients are significantly different from 0. The null is that none of them are significantly different from 0, ie, that the model is entirely insignificant. Of course, one easy way to test this is just to see whether any of the p values for the \(\beta\) coefficients are below 0.05, but sometimes it is the case that all of them are about 0.05, but collectively some of them are low enough that we would not expect to see that by chance. For instance, if we had 10 variables each with a p value between 0.05 and 0.10, while none of them would individually be significant, the probability of getting 10 variables with p values all that low would be very small by pure chance, allowing us to reject the null.

The F statistic for the multiple regression model should look familiar:

\[F = \frac{R^2 / k}{(1-R^2)/(n-k-1)}\]

Unsurprisingly, it is related to the \(R^2\), since both are measures of overall model fit. The basic idea, again, is what is the chance of getting as many \(\beta\) values as far from 0 as we got purely by chance. This is the F test equivalent to the t test against a fixed null – we are testing a bunch of sample statistics (\(\beta\)s) against a fixed value (0).

As with our previous tests, we generate an F statistic, which we either test against a threshold value, or we determine the p value directly. In both cases, we need two different degrees of freedom, as is always the case for the F distribution. As you might expect from the above equation, the first degree of freedom is \(df_1 = k\) and the second is \(df_2 = n - k -1\).

So we can calculate our F statistic and the p value directly:

f <- (r2/k) / ((1-r2)/(n-k-1))
f

[1] 15.32042

pf(f,k,(n-k-1),lower.tail=F)

[1] 7.634769e-15