Overview
This lesson introduces supervised learning and out-of-sample validation.
Objectives
After completing this module, students should be able to:
Readings
Lander, 18.3
Previously: “Big data” problem with large \(p\) (large number of variables) - what were underlying patterns – factors or clusters – for our unstructured data?
This week: “Big data” problem with large \(p\) - how do we model the dependent variable as a function of dozens or even hundreds of others? Which ones do best predict y? Can we use all variables? There are two methods to be discussed this week:
for modeling a continuous y
for modeling a binary y
The fundamental idea behind modeling data is to detect underlying relationships that are true independent of whatever random sample from the population we may have taken.
If \(x\) influences \(y\) in random sample 1, it should also do so in sample 2.
We can predict the new \(\hat{y}\) based on our estimated beta coefficients and the new \(x\) values in that second sample.
If our predicted y values \(\hat{y} = 1 + 2x\) were a poor match for the true \(y\) values in that sample, then we would suspect we had done something wrong in estimating our coefficients in the first sample (eg missed the larged standard errors, not being significantly different from 0, may have been entirely mistaken in the first place)
This is out-of-sample validation
Draw a sample from the population.
Estimate a model of Y as a function of the X variables, yielding some estimated parameters \(\mathbf{\beta}\).
Draw a second sample from the population, and from that sample, use the X values plus the estimated coefficients from (2) to predict \(\hat{y}\) for those new observations.
Compare those predicted \(\hat{y}\) to the true \(y\) values for the second sample. If those predictions are better than we might have done by chance (eg, just guessing the mean \(\bar{y}\) for every \(y_i\)), then our model estimated in (2) is general to the population, and not just a chance artifact of the first sample.
Step 4 can be expanded as a way of comparing multiple different methods.
The better one method predicts the \(y\) in sample 2 using coefficients derived from sample 1, the better that method has managed to extract the hidden patterns connecting \(y\) to \(X\) in the population at large.
Thus if we look at the fit between \(\hat{y}\) and \(y\) in the out-sample (sample 2), we can rank different methods for generating \(\hat{y}\) based on how well they have done.
This is a better method than looking at the fit between \(\hat{y}\) and \(y\) using just the in-sample (sample 1)
You can predict \(y\) quite well on some sample using the coefficients derived from that sample, but when you try to extend it to a new sample the model can fail because it was over-fitted to the first sample.
Why do we need out-of-sample testing when the whole point of our statistical tests for the signficance of our \(\beta\) coefficients was to determine whether they were signficant or not?
True:
For multiple regression, the statistics are well-developed, and we do have the statstical test to answer this.
In fact, if we estimate \(\beta\) using a sample (as we usually do), and get a 95% CI for \(\beta\), then if we take another 100 complete samples each of size \(n\) and estimate \(\beta\) for each one, we should find that approximately 95 those \(\beta\) values fall within the 95% CI for our first \(\beta\).
The statistical standard error predicts exactly what we would get if we did run multiple out-of-sample tests.
But:
Many other methods are more complex than multiple regression, and we cannot derive the complete statistics for errors and p values on their coefficients.
We need some more theory-independent method for testing the validity of those models, and evaluating their relative accuracies.
So: Out-of-sample testing
Without knowing anything of the statistics or errors of a model or its parameters, we can still test its validity to the population as a whole by out-of-sample testing.
We would fit the model on sample 1, and then use the fitted coefficients (or other parameters) from sample 1 to predict \(\hat{y}\) in sample 2 using the new \(x\) values in sample 2.
This is a black-box procedure: we don’t have to know anything about the method that was used to fit the model in sample 1 or to predict \(\hat{y}\) in sample 2.
Let’s examine the concept of out-of-sample testing with a few simple regressions.
Consider the following simulation, which consists of 20 observations, a random x variable, and a y variable that we construct to be \(y = 3x + \epsilon\), where as usual \(\epsilon\) is random noise:
set.seed(2)
x <- rnorm(20)
y <- 3*x + rnorm(20)
dataf <- data.frame(y=y,x=x)
We divide that data into two halves, choosing half the rows at random for the in-sample, and the other rows for the out-sample.
insamplerows <- sample(1:20,10)
outsamplerows <- setdiff(1:20,insamplerows)
insample <- dataf[insamplerows,]
outsample <- dataf[outsamplerows,]
Note that since there’s no order to the observations, it doesn’t matter whether we take the first 10 and second 10, or every other observation, or some random subset; nor does it matter for our purposes here whether the two samples are the same size or different sizes.
Next we regress y on x using just the in-sample observations, and show the regression results – which correctly capture the model we constructed at the beginning (\(y = 3x + \epsilon\))
lmout <- lm(y~x,data=insample)
summary(lmout)
Call:
lm(formula = y ~ x, data = insample)
Residuals:
Min 1Q Median 3Q Max
-2.39407 -0.99907 -0.08799 1.09946 1.85398
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03883 0.54258 -0.072 0.944701
x 2.85798 0.49596 5.762 0.000423 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.553 on 8 degrees of freedom
Multiple R-squared: 0.8059, Adjusted R-squared: 0.7816
F-statistic: 33.21 on 1 and 8 DF, p-value: 0.0004229
Finally we predict \(\hat{y} = \hat{\beta}x\) using the built-in predict
function, and then we plot those predicted values against the true \(y\) values:
yhat_in <- predict(lmout)
plot(yhat_in,insample$y)
Pretty accurate! The model, at least when fit on the in-sample and then tested on the in-sample, does pretty well.
How well do we do using the second, out-sample? Well, we can use predict
again, which takes the \(\hat{\beta}\) that we estimated in lmout
and uses the new data we specify with newdata=
, where here we use the second 10 observations:
yhat_out <- predict(lmout,newdata=outsample)
plot(yhat_out,outsample$y)
Here too we do fairly well: the out-sample \(\hat{y}\) values predicted from the out-sample \(x\) values and the in-sample \(\hat{\beta}\) coefficient nevertheless match fairly closely the true out-sample \(y\) values.
That is, our original regression captured something truly in the population (that is, the model we specified for all the data, \(y = 3x + \epsilon\)), not just a random characteristic of our in-sample data.
What does it look like to appear to do well in-sample, but do poorly out-of-sample?
This is something that comes up a lot when you have lots of independent variables.
set.seed(1)
y <- rnorm(20)
x1 <- rnorm(20)
x2 <- rnorm(20)
x3 <- rnorm(20)
x4 <- rnorm(20)
x5 <- rnorm(20)
x6 <- rnorm(20)
x7 <- rnorm(20)
x8 <- rnorm(20)
dataf=data.frame(y=y,x1=x1,x2=x2,x3=x3,x4=x4,x5=x5,x6=x6,x7=x7,x8=x8)
Note: y is not a function of any of the x variables! The values of x should offer no power to predict y, since they have nothing to do with each other.
But what happens if we regress y on our x’s? Let’s start with using the same data to both fit the model, and predict y. We begin by dividing our data into an in-sample and out-sample, by choosing 50% of our rows at random for the in-sample, the remaining rows for the out-sample:
insamplerows <- sample(1:20,10,replace=F) # 10 random rows
outsamplerows <- setdiff(1:20,insamplerows) # the other 10 rows
insample <- dataf[insamplerows,]
outsample <- dataf[outsamplerows,]
Now let’s fit the model
lmout <- lm(y~x1+x2+x3+x4+x5+x6+x7+x8,data=insample)
yhat_in <- predict(lmout)
plot(yhat_in,insample$y)
We do a pretty good job matching our predicted \(\hat{y}\) to the real \(y\) values, even though there is no connection between our x’s, our \(\beta\) values (which are purely random) and our \(y\).
Problem of overfitting: we match the y well with our x variables just due to the fact that we have a lot of them.
To quantify the fit between the \(\hat{y}\) and the true \(y\), we can calculate the Mean Squared Error or MSE: \(1/n \sum_{i=1}^n (y_i-\hat{y_i})^2\), which we will compare to the out-of-sample prediction shortly.
mse_in <- sum((insample$y-yhat_in)^2)/length(yhat_in)
Note that the regression itself is not fooled:
summary(lmout)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = insample)
Residuals:
15 18 1 8 5 17 10 7 13 11
-0.1843 0.3613 -0.6444 0.5706 0.4515 -0.6998 0.1359 -0.1145 -0.4124 0.5361
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.942e-01 6.646e-01 0.443 0.735
x1 -3.489e-01 9.835e-01 -0.355 0.783
x2 3.868e-01 1.080e+00 0.358 0.781
x3 5.331e-05 1.192e+00 0.000 1.000
x4 1.767e-01 1.787e+00 0.099 0.937
x5 -4.918e-01 1.152e+00 -0.427 0.743
x6 -1.705e-01 2.303e+00 -0.074 0.953
x7 1.649e-01 8.975e-01 0.184 0.884
x8 3.123e-01 6.540e-01 0.477 0.716
Residual standard error: 1.445 on 1 degrees of freedom
Multiple R-squared: 0.5767, Adjusted R-squared: -2.81
F-statistic: 0.1703 on 8 and 1 DF, p-value: 0.9584
Let’s interpret:
The \(R^2\) is high and suggests we are explaining a lot of the variance in Y. But the Adjusted \(R^2\) suggests something is wrong: it is much lower, having penalized the model for having so many independent variables.
More definitively, the f test’s p-value and the individual p-values for all the coefficients are not significant, which is the true measure of statistical significance.
What do we conclude: our high \(R^2\) and the apparent good fit in-sample may be due to over-fitting rather than a real statistical relationship between Y and the X variables.
However, often we don’t have nice statistics with p-values for more complicated models.
So we instead test our fit via out-of-sample prediction. Let’s now apply our in-sample-derived \(\hat{\beta}\) coefficients to new, out-of-sample x variables and try to match the predicted \(\hat{y}\) to the out-of-sample y values:
yhat_out <- predict(lmout,newdata=outsample)
plot(yhat_out,outsample$y)
Now we can see the truth: our out-of-sample predictions are terrible.
All the apparent fit between \(\hat{y}\) and \(y\) in-sample was due to over-fitting, and the out-of-sample test reveals it.
It was not any universal property to our population we were finding; we were just matching the random noise in y using a bunch of random x variables.
We can also calculate the MSE between prediction and truth for the out-sample:
mse_out <- sum((outsample$y-yhat_out)^2)/length(yhat_out)
mse_in
## [1] 0.2088065
mse_out
## [1] 1.578532
Note how the first is much lower than the second, because the accuracy of the in-sample model was just due to over-fitting. The true accuracy of the model is reflected in the out-sample MSE, which is much higher.
Out-of-sample accuracy can serve as the benchmark to draw conclusions about a broader population.
Given some in-sample data, we can fit our model, and then we are given some new out-sample data to predict on.
Validating one’s own model using out-of-sample testing with one’s own data is known as cross-validation.
Steps:
Split the data into repeated in-sample and out-sample groups,
For each run you measure, for instance, the mean absolute error (MAE) between \(y\) and \(\hat{y}\) in the outsample: \(1/n \sum_{i=1}^n |y_i-\hat{y_i}|\), or perhaps the mean squared error (MSE): \(1/n \sum_{i=1}^n (y_i-\hat{y_i})^2\) (so that we don’t depend on getting lucky with a single in-sample/out-sample division)
You do this whole process a few times, and report the overall average accuracy.
Similar process:
Divide your data into \(k\) equal chunks,
and take all but one of those chunks as your in-sample,
and test your model on the remaining chunk;
you then repeat this \(k\) times, leaving out one chunk each time as your out-sample.
A common value of \(k\) is 5 or 10.
Another variant is to make \(k\) equal to \(n\): that is, each out-sample is just a single observation, and you fit the model each time on the \(n-1\) other observations.