This lesson introduces the lasso and ridge regression shrinkage methods.
After completing this module, students should be able to:
Lander, 19.1
Recommended: James et al, An Introduction to Statistical Learning, Chapter 6.2.
In the previous module we have already seen one potential method for reducing a large number of X variables to few variables: we could do a factor analysis on all those variables, and replace the X variables with the fewer factors. Again, each new variable Z would be the projection of that factor on the observations: that is, the score for each observation on that factor (eg, observation \(i\) might be a 1.3 on the factor that ranges between sleepy and energetic).
However, this approach, although perfectly good and very popular, does have the drawback of replacing our existing and well-understood variables with the factors, which might be a bit harder to interpret and might not capture as much of the variation in the data as the original variables did. A different approach might be to better select which subset of the variables actually influence y, and perhaps even weight them according to which ones have a greater or lesser contribution to explaining and predicting y. In a sense, this generalizes what we have always done in multiple regression: in multiple regression, some of the \(\beta\) coefficients are significant, and we might include those in the final regression if we were doing some sort of stepwise variable selection; whereas other \(\beta\) coefficients are not statistically significant, and we declare them to be equivalent to 0.
With shrinkage methods, we take all of perhaps hundreds of \(\beta\) coefficients and declare some 0, some their original estimated value, and others – that are significant, say, but not very significant – something inbetween their estimated value and 0. That is, instead of having such a hard cut-off – which always seems a bit odd given the weirdly hard line at p=0.05 – we have a more gradual line, where the most significant \(\beta\) coefficients stay as they would be from the regression, and others are shrunk more or less towards 0 depending on their significance levels. For all of you who were worrying about the artibrariness of the p=0.05 cutoff, this approach might make you somewhat happier, and is closer to the “bayesian” mindset we discussed earlier.
A second advantage of shrinkage methods – and perhaps a more valuable one than merely theoretical – is that it actually allows us to use quite a few more independent variables without necessarily over-fitting y. Since the shrinkage rate – how severely \(\beta\) coefficients with high p values (ie, the least significant ones) – is something we can set ourselves, we can simply choose the best shrinkage rate given our data to maximize our out-of-sample accuracy. If keeping too many coefficients unshrunk gives too much over-fitting and bad out-of-sample accuracy, we can can just increase the shrinkage rate until only a few coefficients – the most statistically significant ones – are retained, giving a smaller model that more accuratly extrapolates to the out-sample test.
There are two similar methods for shrinkage, lasso and ridge regression, that have different historical origins but very similar structures.
As you may recall, the basic idea in regular multiple regression is to choose the \(\beta\) coefficients that maximize the match between \(\hat{y}\) and \(y\). That is, we want to choose the \(\beta\) coefficients to minimize the following sum:
\[\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_1 - \beta_2 x_2 ...)^2\]
The solution to this minimization problem is what gives us our formula for calculating our \(\beta\) coefficients from our dataset of y and x variables.
The lasso shrinkage model is very similar to this, with one addition. Instead of wanting to minimize the above sum, we instead want to minimize:
\[\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_1 - \beta_2 x_2 ...)^2 + \lambda \sum_{j=1}^p |\beta_j|\]
That is, instead of just optimizing the fit between \(\hat{y}\) and \(y\) (which makes the sum in the first equation as small as possible), we now want to minimize that sum while also minimizing the sum of the absolute values of all the \(\beta\) coefficients (except the constant). This introduces a trade-off: on the one hand, if we keep all the \(\beta\) coefficients at the values as estimated by the simple regression (top equation), we optimize the fit between \(\hat{y}\) and \(y\), but we also have a large \(\sum_{j=1}^p |\beta_j|\) penalty; on the other hand, we can reduce that second penalty to 0 by shrinking all the \(\beta\) coefficients to 0, but at the cost of worsening the fit between \(\hat{y}\) and \(y\) (and thus increasing the first sum). The value \(\lambda\) sets the tradeoff between the two terms: when \(\lambda\) is 0, we have our usual multiple regression equation with entirely un-shrunk coefficients; and when \(\lambda\) is very large, then the first sum plays almost no role and the equation is governed by the second sum, leading to most or all \(\beta\) coefficients being shrunk to 0.
You might ask, if the multiple regression coefficients are the ones that produce the best fit between \(\hat{y}\) and \(y\), why wouldn’t we want to use those? But the answer lies in over-fitting: sometimes it is worth getting a worse fit between \(\hat{y}\) and \(y\) in-sample, because it produces a better fit out-of-sample – and thus is better capturing the true population model, rather than just some random noise. But to get this better out-of-sample validity, we often have to try a bunch of \(\lambda\) values to find the one that works best out-of-sample.
In addition, the shrinkage reduced the number of non-0 variables, which means that both statistically insignificant ones and (depending on \(\lambda\)) the less significant ones get shrunk to 0, leaving many fewer to interpret.
The second variant is ridge regression, which looks very similar to lasso:
\[\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_1 - \beta_2 x_2 ...)^2 + \lambda \sum_{j=1}^p (\beta_j)^2\]
This too penalizes the model for large \(\beta\) coefficients, shrinking them down towards 0. The practical difference is that, while lasso tends to shrink many \(\beta\) coefficients all the way to 0, ridge regression shrinks most of them towards 0, but few all the way to 0. This means that the lasso results are sometimes easier to interpret, because you have overall fewer non-zero coefficients.
One last option is to combine the two into a single shrinkage model, called elastic net:
\[\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_1 - \beta_2 x_2 ...)^2 + \lambda (\alpha \sum_{j=1}^p |\beta_j| + (1-\alpha) \sum_{j=1}^p (\beta_j)^2 )\]
As you can see, this once again has a single parameter \(\lambda\) which sets the total amount of shrinkage, plus another parameter \(\alpha\) that sets the proportion between the lasso and ridge amount (\(\alpha\) ranges between 0 and 1, so the proportion can be anywhere between 100% lasso to 50/50 to 100% ridge).
We use here an example from An Introduction to Statistical Learning with Applications in R (ISLR), which is an excellent book (and can be found online here) and happens to have been written by some of the creators of elastic net and the R package we will be using, glmnet
.
We use a dataset from ISLR, on various baseball players, including their performance levels and salaries:
#install.packages("glmnet")
#install.packages("ISLR") #datasets
library(ISLR)
data(Hitters)
dim(Hitters)
[1] 322 20
head(Hitters)
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
-Andy Allanson 293 66 1 30 29 14 1 293 66
-Alan Ashby 315 81 7 24 38 39 14 3449 835
-Alvin Davis 479 130 18 66 72 76 3 1624 457
-Andre Dawson 496 141 20 65 78 37 11 5628 1575
-Andres Galarraga 321 87 10 39 42 30 2 396 101
-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133
CHmRun CRuns CRBI CWalks League Division PutOuts Assists
-Andy Allanson 1 30 29 14 A E 446 33
-Alan Ashby 69 321 414 375 N W 632 43
-Alvin Davis 63 224 266 263 A W 880 82
-Andre Dawson 225 828 838 354 N E 200 11
-Andres Galarraga 12 48 46 33 N E 805 40
-Alfredo Griffin 19 501 336 194 A W 282 421
Errors Salary NewLeague
-Andy Allanson 20 NA A
-Alan Ashby 10 475.0 N
-Alvin Davis 14 480.0 A
-Andre Dawson 3 500.0 N
-Andres Galarraga 4 91.5 N
-Alfredo Griffin 25 750.0 A
As we can see, there are 20 variables in this dataset and 322 observations. We can use elastic net to model salaries as a function of the players’ various skills and achievements. glmnet
works much like regression, but unlike lm
it doesn’t understand factors, so it expects only a matrix of numeric variables. If you have factors, they can first be converted to dummies, using for instance model.matrix
or build.x
in the useful
package (see Lander or the ISLR book for more on this). But for our purposes here, we’ll just extract the variables from Hitters that are numeric (not including the dependent variable, Salary), and use that:
Hitters <- na.omit(Hitters)
numericvars <- c(1:13,16:18)
x <- as.matrix(Hitters[,numericvars])
y <- Hitters$Salary
glmnet
lets you set your alpha level for the proportion of lasso vs ridge regression; alpha=0 means pure ridge, and alpha=1 means pure lasso. Since the key question is setting the appropriate shrinkage level \(\lambda\) (lambda), glmnet is designed to run repeatedly on a large number of different lambda settings, after which you can predict \(\hat{y}\) for each shrinkage level, and see which predicted \(\hat{y}\) best matches the real \(y\) values out-of-sample.
So next we set the levels of lambda (shrinkage) to tell glmnet to try, ranging from a very high lambda to a very low lambda:
lambdalevels <- 10^seq(7,-2,length=100)
This gives us a 100-item sequence ranging from \(10^7\) down to \(10^{-2}\), which will cause glmnet to run the model 100 times, once with each of these lambda values.
First let’s try it with alpha=0, ie, pure ridge. The formula is simple enough:
library(glmnet)
ridge.mod=glmnet(x,y,alpha=0,lambda=lambdalevels)
glmnet stores the \(\beta\) coefficients for every lambda level, so there are p x #-of-lambda-levels coeffcients stored. To see the 100th set of coefficients (for the smallest lambda), we can write:
coef(ridge.mod)[,100]
(Intercept) AtBat Hits HmRun Runs
126.74245930 -2.19339700 7.67471127 1.71439290 -1.88406702
RBI Walks Years CAtBat CHits
0.12400003 6.10591224 -2.74289815 -0.18258474 0.16444736
CHmRun CRuns CRBI CWalks PutOuts
0.03237573 1.50756927 0.68881161 -0.77369053 0.29461825
Assists Errors
0.38724591 -2.95337311
We can also see how those various \(\beta\) coefficients were shrunk down as we increase lambda from a small level of shrinkage (lambda=\(10^{-2}\)) to a large level (lambda=\(10^7\)).
plot(ridge.mod,xvar="lambda")
The coefficient values are plotted along the y axis, and the (natural log) lambda value along the x; a vertical slice for instance at log lambda=0 gives the values of the 16 coefficientsat that level of shrinkage. One thing you can discern is that “shrinkage” is too simple a term: some coefficients actually can briefly grow in value for some levels of \(\lambda\) if that minimizes the sum in the equation we defined above, although all will eventually shrink to 0 when lambda is large enough.
We can do the same for a pure lasso model:
lasso.mod=glmnet(x,y,alpha=1,lambda=lambdalevels)
plot(lasso.mod,xvar="lambda")
As can be seen, the results are not too dissimilar for these data.
Of course, what we most want is the value of lambda that produces the best out-of-sample fit – ie, that minimizes the out-of-sample mean squared error (MSE) between \(\hat{y}\) and \(y\). glmnet
therefore has built into it a cross-validation procedure, where it does k-fold cross validation for every level of lambda and calculated the MSE for each, and chooses the lambda with the lowest error. That is:
We can then plot the average MSE for each lambda level (red dot), along with the 95% CI for that MSE (because we have \(k\) of them for each lambda level):
set.seed(1)
cv.lasso.mod=cv.glmnet(x,y,alpha=1,lambda=lambdalevels)
plot(cv.lasso.mod)
The dotted vertical line on the left shows the lambda that gets the best MSE, and the dotted line on the right is the MSE that is 1 standard error larger. The values along the top are the number of non-zero coefficients (after shrinkage). As we can see for this data, the best fit is with about 12 coefficients, but we do only one standard error worse (ie, statistically indistinguishable) with a model with only about 4 non-zero coefficients. Thus we can choose either the model with the best out-of-sample accuracy with 12 coefficients, or an only slightly-worse but much simpler model with 4 coefficients.
If we want to know the coefficients for the best lambda, we can get the best lambda and plug it into predict
with s=
for the best lambda and type="coefficients"
to get the coefficients:
bestlambda <- cv.lasso.mod$lambda.min
predict(lasso.mod, type="coefficients",s=bestlambda)
17 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 76.4896752
AtBat -1.7145436
Hits 6.1037612
HmRun .
Runs .
RBI 0.2484785
Walks 4.7562058
Years -9.3407580
CAtBat .
CHits .
CHmRun 0.5407148
CRuns 0.6858672
CRBI 0.3235562
CWalks -0.4882173
PutOuts 0.2800842
Assists 0.1852786
Errors -1.7645945
Note how four have been shrunk to 0.
So finally, let’s see how the shrinkage model compares to regular multiple regression. Imagine we are in a contest: we are given one subset of the data, the training set. We can do whatever we want with it to fit the model, and then we are given a second set of data, the test set. We use our model from the training set to predict \(\hat{y}\) in the test set, and whichever model has the best MSE between its \(\hat{y}\) and the true \(y\) wins (and may in fact be the more “true” model).
First we divide our data randomly into two equal halves, the training set and the test set:
set.seed(1)
train <- sample (1:nrow(x),nrow(x)/2)
test <- (-train)
trainx <- x[train,]
testx <- x[test,]
trainy <- y[train]
testy <- y[test]
First let’s try regession. We fit the model on the training set, use the test data to predict \(\hat{y}\), and then calculate the MSE of that vs the true test \(y\):
lmout <- lm(trainy~trainx)
yhat.r <- cbind(1,testx) %*% lmout$coefficients
# ^^ we predict via matrix multiplication rather than just using predict()
# because x was made a matrix to make glmnet happy
mse.reg <- sum((testy - yhat.r)^2)/nrow(testx)
mse.reg
[1] 128137
Now let’s do it with lasso:
cv.lasso.mod=cv.glmnet(trainx,trainy,alpha=1,lambda=lambdalevels)
yhat.l <- predict(cv.lasso.mod$glmnet.fit, s=cv.lasso.mod$lambda.min, newx=testx)
# ^^ note how we give predict() our best lambda.min value to use for prediction
mse.las <- sum((testy - yhat.l)^2)/nrow(testx)
mse.las
[1] 106427.9
So the lasso MSE is better. Is that difference a lot? Well, we can construct an number like \(R^2\) to compare: we construct the total sum of squares (TSS) as the error between testy
and the mean of testy
. We then compare the sum of squared errors sum((testy - yhat)^2)
for our other two methods (regression and lasso), and construct an \(R^2\)-like measure for each. (It’s not exactly \(R^2\) even for the regression because of course we are comparing \(\hat{y}\) as predicted from one sample with \(y\) in a different sample, rather than all in the same sample as we do with the proper \(R^2\)).
tss <- sum((testy - mean(testy))^2)
sse.reg <- sum((testy - yhat.r)^2)
sse.las <- sum((testy - yhat.l)^2)
r2.r <- (tss - sse.reg) / tss
r2.l <- (tss - sse.las) / tss
r2.r
[1] 0.3349609
r2.l
[1] 0.4476324
As we can see, the out-of-sample pseudo-\(R^2\) is considerably higher for the lasso than the regression method. Many other tests have established lasso and ridge regression (and elastic net more generally) as some of the most successful out-of-sample modeling methods. It is likely that the lasso model is doing a better job of fitting the real population, and not over-fitting due to having a high number of independent variables, the way the regular multiple regression is prone to do. Similarly, we might reasonably believe that the lasso coefficients in the previous slide are likely to be more accurate measures of the true effects than the coefficients we get from just running the lm
regression.