I. Maximal Margin Classifiers
The maximal margin classifier, or the optimal separating hyperplane, is the separating hyperplane1 that is farthest away from the training observations. This classifier depends directly on only a small subset of the training observations, called support vectors.
The maximal margin hyperplane is the solution to the optimization problem:
- is the margin, the minimal orthogonal distance from the training observations to the hyperplane
- The first condition guarantees a unique solution where the orthogonal distance from the th observation to the hyperplane fulfills the second condition.
- The second condition guarantees that each observation is on the correct side of the hyperplane.
We then classify each observation based on where it lies in relation to the hyperplane derived from from the optimization problem above.
II. Support Vector Classifiers
If no separating hyperplane exists, then there is no maximal margin classifier. In the non-separable case, the generalization of the maximal margin classifier is called the support vector classifier.
The support vector classifier, or the soft margin classifier, provides greater robustness to individual observations, and better classification of most of the training observations than the maximal margin classifier.
- Where are the slack variables whose sum is less than or equal to the tuning parameter .
- If , then the th observation is on the correct side of the margin
- If , then the th observation is on the wrong side of the margin
- If , then the th observation is on the wrong side of the hyperplane
- is a “budget” for how much the margin can be violated. No more than observations can violate the hyperplane at once.
- Choose with cross validation.
- A higher means the model is more tolerant of margin violations.
- If , then the solution is equivalent to that of the maximal margin classifier.
- A smaller corresponds with high variance and low bias; a larger corresponds with high bias and low variance.
The support vectors in this model (the only observations that affect the hyperplane) are the observations that are on the margin or that violate the margin. Support vector classifiers are robust to observations far away from the hypersplane, unlike methods such as linear discriminant analysis.
III. Support Vector Machines
In non-linear decision boundaries, we can modify the condition in the support vector classifier to include non-linear functions.
The solution to the linear support vector classifier involves only the inner products of the observations, . It can be shown that the linear support vector classifier can be represented by:
- There are inner products between all pairs of training observations needed to estimate the parameters .
- is the collection of indices of support points, since is only non-zero for support vectors.
We can generalize the support vector classifier with a kernel, , a function that quantifies the similarity of two observations, instead of the inner product (linear kernel).
Support vector machines use the support vector classifier with a non-linear kernel:
Examples of different kernels:
- Linear Kernel (inner product):
- Polynomial Kernel of degree d
- Radial Kernel (very local behavior)
SVMs are computational nice, since they only need to calculate for distinct pairs . This can be done without explicitly working in an enlarged feature space. The radial kernel feature space is implicit and infinite-dimensional, so we need a kernel to use it.
An ROC curve of the false positive rate against the true positive rate can help graphically compare the performance of SVMs.
Extension to more than two classes
- Constructs SVMs, each of which compares a pair of classes.
- Classify a test observation into each of the pairs, tally the number of times it’s classified into each of those classes, and assign it to the class for which it was the most frequently assigned.
- Fit SVMs. For each fit, compare one of the classes to the other classes.
- Assign observations to the class for which is largest.
Note: Support vector regression also exists.
Relationship to logistic regression
The loss + penalty function of SVMs, shown below, is very similar to the loss function of logistic regression.
is a nonnegative tuning parameter. When is large, then the coefficients are small and more violations to the margin are tolerated for lower variance and higher bias. This equation takes the “Loss + Penalty” form that many other models take, and the loss function of an SVM is known as “hinge loss.”
SVMs perform better when classes are well-separated, while logistic regression performs better with overlapping classes.
A graphical comparison of SVM loss vs. logistic regression loss is shown below.2