Overview

This lesson introduces the lasso and ridge regression shrinkage methods.

Objectives

After completing this module, students should be able to:

  1. Explain how shrinkage methods work to reduce the number of variables in a model.
  2. Distinguish between lasso, ridge regression, and elastic net.
  3. Apply the glmnet package to estimate models with many independent variables.
  4. Test and validate those models out-of-sample.

Reading

Lander, 19.1

Recommended: James et al, An Introduction to Statistical Learning, Chapter 6.2.

1 Shrinkage

In the previous module we have already seen one potential method for reducing a large number of X variables to few variables: we could do a factor analysis on all those variables, and replace the X variables with the fewer factors. Again, each new variable Z would be the projection of that factor on the observations: that is, the score for each observation on that factor (eg, observation \(i\) might be a 1.3 on the factor that ranges between sleepy and energetic).

However, this approach, although perfectly good and very popular, does have the drawback of replacing our existing and well-understood variables with the factors, which might be a bit harder to interpret and might not capture as much of the variation in the data as the original variables did. A different approach might be to better select which subset of the variables actually influence y, and perhaps even weight them according to which ones have a greater or lesser contribution to explaining and predicting y. In a sense, this generalizes what we have always done in multiple regression: in multiple regression, some of the \(\beta\) coefficients are significant, and we might include those in the final regression if we were doing some sort of stepwise variable selection; whereas other \(\beta\) coefficients are not statistically significant, and we declare them to be equivalent to 0.

With shrinkage methods, we take all of perhaps hundreds of \(\beta\) coefficients and declare some 0, some their original estimated value, and others – that are significant, say, but not very significant – something inbetween their estimated value and 0. That is, instead of having such a hard cut-off – which always seems a bit odd given the weirdly hard line at p=0.05 – we have a more gradual line, where the most significant \(\beta\) coefficients stay as they would be from the regression, and others are shrunk more or less towards 0 depending on their significance levels. For all of you who were worrying about the artibrariness of the p=0.05 cutoff, this approach might make you somewhat happier, and is closer to the “bayesian” mindset we discussed earlier.

A second advantage of shrinkage methods – and perhaps a more valuable one than merely theoretical – is that it actually allows us to use quite a few more independent variables without necessarily over-fitting y. Since the shrinkage rate – how severely \(\beta\) coefficients with high p values (ie, the least significant ones) – is something we can set ourselves, we can simply choose the best shrinkage rate given our data to maximize our out-of-sample accuracy. If keeping too many coefficients unshrunk gives too much over-fitting and bad out-of-sample accuracy, we can can just increase the shrinkage rate until only a few coefficients – the most statistically significant ones – are retained, giving a smaller model that more accuratly extrapolates to the out-sample test.

1.1 Lasso

There are two similar methods for shrinkage, lasso and ridge regression, that have different historical origins but very similar structures.

As you may recall, the basic idea in regular multiple regression is to choose the \(\beta\) coefficients that maximize the match between \(\hat{y}\) and \(y\). That is, we want to choose the \(\beta\) coefficients to minimize the following sum:

\[\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_1 - \beta_2 x_2 ...)^2\]

The solution to this minimization problem is what gives us our formula for calculating our \(\beta\) coefficients from our dataset of y and x variables.

The lasso shrinkage model is very similar to this, with one addition. Instead of wanting to minimize the above sum, we instead want to minimize:

\[\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_1 - \beta_2 x_2 ...)^2 + \lambda \sum_{j=1}^p |\beta_j|\]

That is, instead of just optimizing the fit between \(\hat{y}\) and \(y\) (which makes the sum in the first equation as small as possible), we now want to minimize that sum while also minimizing the sum of the absolute values of all the \(\beta\) coefficients (except the constant). This introduces a trade-off: on the one hand, if we keep all the \(\beta\) coefficients at the values as estimated by the simple regression (top equation), we optimize the fit between \(\hat{y}\) and \(y\), but we also have a large \(\sum_{j=1}^p |\beta_j|\) penalty; on the other hand, we can reduce that second penalty to 0 by shrinking all the \(\beta\) coefficients to 0, but at the cost of worsening the fit between \(\hat{y}\) and \(y\) (and thus increasing the first sum). The value \(\lambda\) sets the tradeoff between the two terms: when \(\lambda\) is 0, we have our usual multiple regression equation with entirely un-shrunk coefficients; and when \(\lambda\) is very large, then the first sum plays almost no role and the equation is governed by the second sum, leading to most or all \(\beta\) coefficients being shrunk to 0.

You might ask, if the multiple regression coefficients are the ones that produce the best fit between \(\hat{y}\) and \(y\), why wouldn’t we want to use those? But the answer lies in over-fitting: sometimes it is worth getting a worse fit between \(\hat{y}\) and \(y\) in-sample, because it produces a better fit out-of-sample – and thus is better capturing the true population model, rather than just some random noise. But to get this better out-of-sample validity, we often have to try a bunch of \(\lambda\) values to find the one that works best out-of-sample.

In addition, the shrinkage reduced the number of non-0 variables, which means that both statistically insignificant ones and (depending on \(\lambda\)) the less significant ones get shrunk to 0, leaving many fewer to interpret.

1.2 Ridge regression and elastic net

The second variant is ridge regression, which looks very similar to lasso:

\[\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_1 - \beta_2 x_2 ...)^2 + \lambda \sum_{j=1}^p (\beta_j)^2\]

This too penalizes the model for large \(\beta\) coefficients, shrinking them down towards 0. The practical difference is that, while lasso tends to shrink many \(\beta\) coefficients all the way to 0, ridge regression shrinks most of them towards 0, but few all the way to 0. This means that the lasso results are sometimes easier to interpret, because you have overall fewer non-zero coefficients.

One last option is to combine the two into a single shrinkage model, called elastic net:

\[\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_1 - \beta_2 x_2 ...)^2 + \lambda (\alpha \sum_{j=1}^p |\beta_j| + (1-\alpha) \sum_{j=1}^p (\beta_j)^2 )\]

As you can see, this once again has a single parameter \(\lambda\) which sets the total amount of shrinkage, plus another parameter \(\alpha\) that sets the proportion between the lasso and ridge amount (\(\alpha\) ranges between 0 and 1, so the proportion can be anywhere between 100% lasso to 50/50 to 100% ridge).

2 Example

We use here an example from An Introduction to Statistical Learning with Applications in R (ISLR), which is an excellent book (and can be found online here) and happens to have been written by some of the creators of elastic net and the R package we will be using, glmnet.

We use a dataset from ISLR, on various baseball players, including their performance levels and salaries:

#install.packages("glmnet")
#install.packages("ISLR") #datasets
library(ISLR)
data(Hitters)
dim(Hitters)
[1] 322  20
head(Hitters)
                  AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Andy Allanson      293   66     1   30  29    14     1    293    66      1
-Alan Ashby         315   81     7   24  38    39    14   3449   835     69
-Alvin Davis        479  130    18   66  72    76     3   1624   457     63
-Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
-Andres Galarraga   321   87    10   39  42    30     2    396   101     12
-Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
                  CRuns CRBI CWalks League Division PutOuts Assists Errors
-Andy Allanson       30   29     14      A        E     446      33     20
-Alan Ashby         321  414    375      N        W     632      43     10
-Alvin Davis        224  266    263      A        W     880      82     14
-Andre Dawson       828  838    354      N        E     200      11      3
-Andres Galarraga    48   46     33      N        E     805      40      4
-Alfredo Griffin    501  336    194      A        W     282     421     25
                  Salary NewLeague
-Andy Allanson        NA         A
-Alan Ashby        475.0         N
-Alvin Davis       480.0         A
-Andre Dawson      500.0         N
-Andres Galarraga   91.5         N
-Alfredo Griffin   750.0         A

As we can see, there are 20 variables in this dataset and 322 observations. We can use elastic net to model salaries as a function of the players’ various skills and achievements. glmnet works much like regression, but unlike lm it doesn’t understand factors, so it expects only a matrix of numeric variables. If you have factors, they can first be converted to dummies, using for instance model.matrix or build.x in the useful package (see Lander or the ISLR book for more on this). But for our purposes here, we’ll just extract the variables from Hitters that are numeric (not including the dependent variable, Salary), and use that.

(Note that we have to create separate X and Y data structures for the sake of glmnet, where the first is a numeric matrix of independent variables, and the second is the numeric dependent variable vector. Make sure you don’t accidentally include your dependent Y variable as part of your X variable matrix!)

Hitters <- na.omit(Hitters)
numericvars <- c(1:13,16:18)
x <- as.matrix(Hitters[,numericvars])
y <- Hitters$Salary

glmnet lets you set your alpha level for the proportion of lasso vs ridge regression; alpha=0 means pure ridge, and alpha=1 means pure lasso. Since the key question is setting the appropriate shrinkage level \(\lambda\) (lambda), glmnet is designed to run repeatedly on a large number of different lambda settings, after which you can predict \(\hat{y}\) for each shrinkage level, and see which predicted \(\hat{y}\) best matches the real \(y\) values out-of-sample.

So next we set the levels of lambda (shrinkage) to tell glmnet to try, ranging from a very high number to a very low number:

lambdalevels <- 10^seq(7,-2,length=100)
head(lambdalevels)
[1] 10000000  8111308  6579332  5336699  4328761  3511192
tail(lambdalevels)
[1] 0.02848036 0.02310130 0.01873817 0.01519911 0.01232847 0.01000000

This gives us a 100-item sequence ranging from \(10^7\) down to \(10^{-2}\), which will cause glmnet to run the model 100 times, once with each of these lambda values.

2.1 Example, ridge

First let’s try it with alpha=0, ie, pure ridge. The formula is simple enough:

library(glmnet)
ridge.mod=glmnet(x,y,alpha=0,lambda=lambdalevels)

glmnet stores the \(\beta\) coefficients for every lambda level, so the output is a matrix with each column being the coefficients for a given lambda level. To see the 100th set of coefficients (for the smallest lambda), we can write:

coef(ridge.mod)[,100]
 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
126.74245930  -2.19339700   7.67471127   1.71439290  -1.88406702   0.12400003 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
  6.10591224  -2.74289815  -0.18258474   0.16444736   0.03237573   1.50756927 
        CRBI       CWalks      PutOuts      Assists       Errors 
  0.68881161  -0.77369053   0.29461825   0.38724591  -2.95337311 

We can also see how those various \(\beta\) coefficients were shrunk down as we increase lambda from a small level of shrinkage (lambda=\(10^{-2}\)) to a large level (lambda=\(10^7\)).

plot(ridge.mod,xvar="lambda")

The coefficient values are plotted along the y axis, and the (natural log) lambda value along the x; a vertical slice for instance at log lambda=0 gives the values of the 16 coefficients at that level of shrinkage. One thing you can discern is that “shrinkage” is too simple a term: some coefficients actually can briefly grow in value for some levels of \(\lambda\) if that minimizes the sum in the equation we defined above, although all will eventually shrink to 0 when lambda is large enough.

2.2 Example, lasso

We can do the same for a pure lasso model:

lasso.mod=glmnet(x,y,alpha=1,lambda=lambdalevels)
plot(lasso.mod,xvar="lambda")

As can be seen, the results are not too dissimilar for these data. Note that lasso will eventually shrink all coefficients to 0, whereas with ridge they get close to 0 but never exactly 0. That’s why in the lasso plot, the numbers along the top (the number of non-zero coefficients) eventually goes down to 0 for lasso, but remains at 16 the whole time for ridge – even though for ridge, by the time you get to a large (log) lambda of 15 the coefficients are all very close to 0.

2.3 Choosing lambda

Since we are dealing with high-dimensional data, there is a huge danger of over-fitting: getting a model that fits well in-sample, but does terribly when predicting with new data. So what we need to figure out is the value of lambda that produces the best out-of-sample fit – ie, that minimizes the out-of-sample mean squared error (MSE) between \(\hat{y}\) and \(y\). glmnet therefore has built into it a cross-validation procedure, where it automatically does k-fold cross validation to find the best lambda. So for whatever data (x,y) you give it, automatically:

  1. The data are divided into \(k\) portions.
  2. For each level of lambda, the model is fit on the in-sample x and y (all but portion \(k\)); \(\hat{y}\) is predicted on using the out-sample x; and the MSE for that lambda level is calculated.
  3. Thus for each level of lambda there are \(k\) MSE values, one for each of the \(k\) out-of-sample tests.

We can then plot the average MSE for each lambda level (red dot), along with the 95% CI for that MSE (because we have \(k\) of them for each lambda level):

set.seed(1)
cv.lasso.mod=cv.glmnet(x,y,alpha=1,lambda=lambdalevels)
plot(cv.lasso.mod)

The dotted vertical line on the left shows the lambda that gets the best MSE, and the dotted line on the right is the MSE that is 1 standard error larger. The values along the top are the number of non-zero coefficients (after shrinkage). As we can see for this data, the best fit is with about 12 coefficients, but we do only one standard error worse (ie, statistically indistinguishable) with a model with only about 4 non-zero coefficients. Thus we can choose either the model with the best out-of-sample accuracy with 12 coefficients, or an only slightly-worse but much simpler model with 4 coefficients.

To see the results of cv.glmnet summarized, just call the output; it tells you the best (minimum) lambda value (the dotted line above corresponding to the lowest red dot) and its assocuated standard error (SE), as well as the lambda value one standard error in the direction of fewer “Nonzero” beta values.

cv.lasso.mod

Call:  cv.glmnet(x = x, y = y, lambda = lambdalevels, alpha = 1) 

Measure: Mean-Squared Error 

    Lambda Measure    SE Nonzero
min   1.23  116070 18434      14
1se  81.11  133376 12780       5

If we want to know the coefficients for the best lambda, we can get the best lambda and plug it into predict with s= for the best lambda and type="coefficients" to get the coefficients:

bestlambda <- cv.lasso.mod$lambda.min
predict(lasso.mod, type="coefficients",s=bestlambda)
17 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept) 108.94266036
AtBat        -2.06818052
Hits          6.93062388
HmRun         .         
Runs         -0.45633769
RBI           0.47800316
Walks         5.43594076
Years        -7.05006588
CAtBat       -0.06823454
CHits         .         
CHmRun        0.27024056
CRuns         1.13592168
CRBI          0.51872590
CWalks       -0.67345310
PutOuts       0.29211026
Assists       0.29233065
Errors       -2.39211866

Note how four have been shrunk to 0.

2.4 Comparison with regression

So finally, let’s see how the shrinkage model compares to regular multiple regression. Imagine we are in a contest: we are given one subset of the data, the training set. We can do whatever we want with it to fit the model, and then we are given a second set of data, the test set. We use our model from the training set to predict \(\hat{y}\) in the test set, and whichever model has the best MSE between its \(\hat{y}\) and the true \(y\) wins (and may in fact be the more “true” model).

First we divide our data randomly into two equal halves, the training (in-sample) set and the test (out-sample) set:

set.seed(1)
trainingrows <- sample(1:nrow(x),nrow(x)/2) # half the rows at random
testrows <- setdiff(1:nrow(x),trainingrows) # the other half
trainx <- x[trainingrows,]  # matrix
trainy <- y[trainingrows] # vector
testx <- x[testrows,] # matrix
testy <- y[testrows] # vector

First let’s try regession. We fit the model on the training set, use the test data to predict \(\hat{y}\), and then calculate the MSE of that vs the true test \(y\):

lmout <- lm(trainy~trainx)
yhat.r <- cbind(1,testx) %*% lmout$coefficients
# ^^ we predict via matrix multiplication rather than just using predict() 
#    because x was made a matrix to make glmnet happy
mse.reg <- sum((testy - yhat.r)^2)/nrow(testx)
mse.reg
[1] 186205.5

Now let’s do it with lasso:

cv.lasso.mod=cv.glmnet(trainx,trainy,alpha=1,lambda=lambdalevels)
yhat.l <- predict(cv.lasso.mod$glmnet.fit, s=cv.lasso.mod$lambda.min, newx=testx)
# ^^ note how we give predict() our best lambda.min value to use for prediction
mse.las <- sum((testy - yhat.l)^2)/nrow(testx)
mse.las
[1] 147785.1

So the lasso MSE is better. Is that difference a lot? Well, we can construct an number like \(R^2\) to compare: we construct the total sum of squares (TSS) as the error between testy and the mean of testy. We then compare the sum of squared errors sum((testy - yhat)^2) for our other two methods (regression and lasso), and construct an \(R^2\)-like measure for each. (It’s not exactly \(R^2\) even for the regression because of course we are comparing \(\hat{y}\) as predicted from one sample with \(y\) in a different sample, rather than all in the same sample as we do with the proper \(R^2\)).

tss <- sum((testy - mean(testy))^2)
sse.reg <- sum((testy - yhat.r)^2)
sse.las <- sum((testy - yhat.l)^2)
r2.r <- (tss - sse.reg) / tss
r2.l <- (tss - sse.las) / tss
r2.r
[1] 0.1709626
r2.l
[1] 0.3420207

As we can see, the out-of-sample pseudo-\(R^2\) is considerably higher for the lasso than the regression method. Many other tests have established lasso and ridge regression (and elastic net more generally) as some of the most successful out-of-sample modeling methods. It is likely that the lasso model is doing a better job of fitting the real population, and not over-fitting due to having a high number of independent variables, the way the regular multiple regression is prone to do. Similarly, we might reasonably believe that the lasso coefficients in the previous slide are likely to be more accurate measures of the true effects than the coefficients we get from just running the lm regression.