Overview
This lesson introduces Support Vector Machines.
Objectives
After completing this module, students should be able to:
Reading
Recommended: James et al, An Introduction to Statistical Learning, Chapter 9.1-9.2
As with logistic regression, we need slightly different methods for predicting binary variables when we have lots of indpendent variables. We could actually use our existing logit methods with a shrinkage aspect as per our previous lesson, but the machine learning community has already developed a large number of methods designed specifically for categorical prediction. This turns out to be an important class of problem – is a traveller a terrorist or not; is a picture of a cat or a dog; will a customer like this movie or not; etc – and thus decades of work have been dedicated to building better models for making binary decisions.
One of the fastest and most effective of this methods is the support vector machine (SVM), which was originally designed for simple binary classification with good out-of-sample predictive results, though it has since been extended to multiple categories and many other domains.
The fundamental idea behind a SVM can be illustrated using our clustering example from the previous module.
Imagine the data plotted above comprise three variables: the x1 value (horizontal axis), the x2 value (vertical axis), and the y value, which is a 1 or a 0. Observations where y=1 are plotted in blue, and observations where y=0 are plotted in red. The question is, if were given a new data point, would we classify it as a 0 or a 1? How would we decide? Well, the obvious answer is that if that point was closer to the centroid of the blue cluster, we would guess that point was a 1, and if it was closer to the centroid of the red cluster, we would guess it to be a 0. This is equivalent to saying that if it falls on one side of the line between the two clusters, it’s a 1, and vice versa if it’s on the other side:
An SVM is just a method for finding this best line given the X variables and a binary Y variable. As we can see from the picture, this is equivalent to defining two clusters, but it turns out that we can find a better dividing line (“better” in the sense of better out-of-sample accuracy) that just by calculating the centroid for the cluster defined by Y=1 and the centroid defined by Y=0 and using the midway line as our cutoff.
In more than 2 dimensions (ie, with more than two independent variables), the dividing line is actually a “hyperplane”. If you visualize points in three dimensions clustered into two groups, you can see that the divider separating points closer to cluster 1 vs those closer to cluster 2 is a flat 2-dimensional plane, rather than the 1-dimensional line from our previous example. And this continues: if the data are \(k\) dimensional, the separating “hyperplane” is \(k-1\) dimensional. This may sound complicated, but it’s actually algebraically identical to what we were doing with multiple regression. The equation for Y in a multiple regression is actually a hyperplane:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3\]
That equation actually defines a hyperplane in three dimensions. You can’t visualize it, but that doesn’t matter as long as you have the core idea. If our binary data are in \(k\) dimensions, therefore, the plane separating them is just:
\[\beta_0 + \beta_1 X_1 + ... + \beta_p X_p\]
although of course these \(\beta\) coefficients will not be the same as the ones you get regressing Y on the X variables. The job of the SVM is to find the best \(\beta\) coefficients that define the plane that best separates the Y=1 points from the Y=0 points.
The SVM differs from the cluster centroid approach defined above by how it defines “best”. The SVM defines the best plane as that plane which is farthest from our data. That is, there are a lot (infinte number) of planes that will separate our two clusters:
The SVM just chooses that plane which is farthest from the data: that is, the one with the largest distance to the closest data point. It turns out this method works remarkably well for out-of-sample prediction, much better than just using the plane defined by the midway point between the two centroids. However, as you can imagine, in many cases the two sets of data are not nearly so well separated – in fact, if you imagine those two clusters closer together, there will often not be any line that cleanly separates the two groups. In fact, in the real world, this is almost always the case, and we have to just choose the plane that does the best job we can in separating the two classes of points.
Like the shrinkage approaches, the SVM has a parameter you need to tune to get the best results. That parameter (often denoted \(\gamma\) or \(C\)) is how tolerant you are of data points being on the wrong side of the classification line. In most cases, getting some points wrong is inevitable, but there is still a tradeoff: a low tolerance level means you try to find the plane with the least amount of mis-classifications; a higher tolerance level means you find a plane with maybe more mis-classifications, but which may do a better overall job of keeping the bulk of the 1’s on one side and a bulk of the 0’s on the other.
The details of how we find this plane given a set tolerance level go beyond what we will cover here, but are well described in An Introduction to Statistical Learning, sections 9.1 and 9.2. The important practical advantage of the SVM is that, not only is it quite accurate, there are quick and exact solutions to find this best cutting plane, so it is very fast to estimate.
For our purposes, we will content ourselves to interpreting and judging the results of the SVM. Let’s begin with our simple dataset, but make it a bit harder to make it somewhat more interesting. Here’s the revised data:
As you can see, now ther is no easy solution to the best line separating the two groups, and indeed any line will have to make a certain number of errors (points on the wrong side).
We run the SVM using the svm
function in the inexplicably-named e1071
package. First we have to prep the data to get it into a format svm
likes, which in this case is a data frame with y as a factor. Then we run the svm specifying the dependent variable and independent variables (y~.
just means y as function of all the others; this is a general shortcut you can use for lm and many other models), the cost \(C\), and the “kernel,” which we return to shortly.
# install.packages("e1071")
library(e1071)
svmfit <- svm(y~., data=simdat, cost=10, kernel="linear")
We can plot the output:
plot(svmfit,simdat)
This shows the cutting line and which regions are assigned to Y=1 or Y=0. It also shows the data points and the “support vectors” (the x’s) which are those data points close to the dividing line that affect its position (the ones far from the line don’t affect its position, because the dividing line algorithm only cares about maximizing the distance to the closest points). The red x’s are those that have been mis-classified (either 1’s in the 0 zone, or 0’s in the 1 zone.)
So how do we chose our cost C? Once again, we want the best out-of-sample accuracy to make sure we aren’t fitting the model on abitrary aspects of our sample. So as we did with cv.glmnet
, we want to test a variety of cost levels using cross-validation: divide the sample into \(k\) folds, fit the model at a bunch of different cost levels on the in-sample, and test on the out-sample; then chose the cost level that produces the most accurate fit out-of-sample. We do this using the tune()
function that’s part of the e1071
package, which is very much like the cv.glmnet
function, and by default runs 10-fold cross-validation internally.
costvalues <- 10^seq(-3,2,1)
tuned.svm <- tune(svm, y~., data=simdat, ranges=list(cost=costvalues), kernel="linear")
summary(tuned.svm)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost
0.1
- best performance: 0.075
- Detailed performance results:
cost error dispersion
1 1e-03 0.580 0.03496029
2 1e-02 0.080 0.05868939
3 1e-01 0.075 0.06346478
4 1e+00 0.080 0.05868939
5 1e+01 0.080 0.05868939
6 1e+02 0.080 0.05868939
Note that at the best cost setting (0.01) we get only 7.5% of the points incorrectly classified, which is not surprising given than even in this harder case, the two groups are fairly well-separated in space.
Let us again use the Hitters baseball dataset as a test example, but this time, let’s turn Y into a binary, equal to 1 if the player earns more than $500K a year, and 0 if not.
y <- ifelse(Hitters$Salary > 500,1,0)
summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.4259 1.0000 1.0000
Note that only about 43% of players earn that much, at least in the era of this dataset. Thus if we were just in the game of predicting \(y\) for any given player and we just blindly guessed that no, they do not earn more than $500K, we would get almost 58% of the players correct. So whatever our predictive successes are for SVM (or any other method), it has to be relative to that baseline, which is equivalent to our baseline guess of \(\bar{y}\) in the continuous case.
dat <- data.frame(x=x, y=as.factor(y))
tuned.svm <- tune(svm,y~., data=dat, ranges=list(cost=costvalues), kernel="linear")
summary(tuned.svm)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost
100
- best performance: 0.1747863
- Detailed performance results:
cost error dispersion
1 1e-03 0.2891738 0.06862293
2 1e-02 0.2243590 0.10759623
3 1e-01 0.2049858 0.10460461
4 1e+00 0.1974359 0.09438616
5 1e+01 0.1821937 0.09242430
6 1e+02 0.1747863 0.08492771
So here our best performance is about 18.2% wrong, or 81.8% correct. This is much better than we would do just by guessing that everyone is below $500K, a definite improvement.
Let’s do a proper test: as in the previous lesson, we divide up our data into two halves, fit our model on one half, and then test it on the other.
set.seed(1)
trainingrows <- sample(1:nrow(x),nrow(x)/2)
testrows <- setdiff(1:nrow(x),trainingrows)
traindat <- dat[trainingrows,]
testdat <- dat[testrows,]
Now we fit the model on the training data:
tuned.svm <- tune(svm,y~., data=traindat, ranges=list(cost=costvalues), kernel="linear")
And then predict y using the test data:
yhat <- predict(tuned.svm$best.model,newdata=testdat)
table(predicted=yhat,truth=testdat$y)
truth
predicted 0 1
0 59 16
1 15 42
Note how the cross-validated tune
output contains within it the best svm model under $best.model
– we could also have just used svmfit as the direct output of svm
using a pre-set cost, but this would not be as optimal as using the internal cross-validation to get the best cost level.
The table above is called the confusion matrix, ie, the 1’s that were correctly predicted to be 1’s, the 1’s that were incorrectly predicted to be 0’s, and so forth. We can get the percentage correct with:
sum(yhat==testdat$y)/length(testdat$y)
[1] 0.7651515
This is not as well as we did purely in-sample, but that’s why out-of-sample testing is so rigorous. It means that if we over-fit the in-sample data, we are called out on it when we go out of sample. The cross-validation of tune()
might have led us to be a little over-optimistic about our model, but 77% is still much better than we would have gotten just by guessing 0 for everyone:
sum(yhat==rep(0,length(testdat$y)))/length(testdat$y)
[1] 0.5681818
As you might have noticed, one of the options in svm
was the “kernel.” The kernel in an SVM is actually one of the major secrets of its success. The kernel acts, effectively, to warp the space that our data exists in. As you might guess, sometimes a flat line or hyperplane is not the best way to divide two categories. Imagine that one cluster is shaped like pacman, and the other like a pellet that pacman is about to eat: in that case, a straight line does a much worse job of separating the two clusters than a curved line between pacman and the pellet would do. The kernel of a SVM acts to bend the space before fitting the SVM, so that the hyperplane effectively becomes a curve, or many other different sorts of shapes.
One of the most powerful of the kernels is the “radial” kernel, which curves the space. Let’s test our model using the radial kernel:
tuned.svm <- tune(svm,y~., data=traindat, ranges=list(cost=costvalues), kernel="radial")
yhat <- predict(tuned.svm$best.model,newdata=testdat)
sum(yhat==testdat$y)/length(testdat$y)
[1] 0.8257576
Much better! The SVM with the radial kernel (the default setting in R) is a very powerful binary modeling method.
How about we compare a logistic regression, just as a baseline? Remember, the logistic regression is similar to a linear regression, just with a binary dependent variable and the logistic function wrapped around \(\beta_0 + \beta_1x_1 +...\) to make sure all predicted y values are between 0 and 1.
logit.out <- glm(y~.,data=traindat,family = "binomial")
yhat.con <- predict(logit.out,newdata=testdat,type="response")
yhat <- round(yhat.con)
# Note how we use the "reponse" option to the get predicted probabilities,
# and then we must round, so that any predicted prob > 0.5 is a 1, and vice versa for 0.
sum(yhat==testdat$y)/length(testdat$y)
[1] 0.780303
Not terrible – much better than just guessing 0, and about as well as the linear kernel SVM. But the radial kernel SVM is still much better.
One drawback of the SVM though, particularly with the radial kernel, is that it makes it much harder to assess which variables are playing the largest role in allowing us to predict the dependent variable. While the logit may do worse, we can at least interpret the coefficients directly to see which variables are playing what role. There are of course ways of also doing this with the SVM (such as repeating the process with various variables omitted, or looking more closely at the support vectors), but they are trickier. The tradeoff is a common one: the logit predicts less well but is more interpretable; the SVM is less interpretable, but predicts very well indeed.
And like the logit, the SVM can also be readily extended to dependent variables with more than two categories – but delving into that is beyond what we can cover here.