# Support Vector Machines

Support vector machines (SVMs) are often considered one of the best 'out of the box' classifiers. The simple maximal margin classifier can be generalized to the support vector classifier, which can be further generalized to the support vector machine.

May 2, 2020 - 13 minute read -

### I. Maximal Margin Classifiers

The maximal margin classifier, or the optimal separating hyperplane, is the separating hyperplane1 that is farthest away from the training observations. This classifier depends directly on only a small subset of the training observations, called support vectors.

The maximal margin hyperplane is the solution to the optimization problem:

Where:

• $M$ is the margin, the minimal orthogonal distance from the training observations to the hyperplane
• The first condition guarantees a unique solution where the orthogonal distance from the $i$th observation to the hyperplane fulfills the second condition.
• The second condition guarantees that each observation is on the correct side of the hyperplane.

We then classify each observation based on where it lies in relation to the hyperplane derived from $\beta_0, \beta_1, ..., \beta_p$ from the optimization problem above.

### II. Support Vector Classifiers

If no separating hyperplane exists, then there is no maximal margin classifier. In the non-separable case, the generalization of the maximal margin classifier is called the support vector classifier.

The support vector classifier, or the soft margin classifier, provides greater robustness to individual observations, and better classification of most of the training observations than the maximal margin classifier.

Where:

• Where $\epsilon_i$ are the slack variables whose sum is less than or equal to the tuning parameter $C$.
• If $\epsilon_i = 0$, then the $i$th observation is on the correct side of the margin
• If $\epsilon_i > 0$, then the $i$th observation is on the wrong side of the margin
• If $\epsilon_i > 1$, then the $i$th observation is on the wrong side of the hyperplane
• $C$ is a “budget” for how much the margin can be violated. No more than $C$ observations can violate the hyperplane at once.
• Choose with cross validation.
• A higher $C$ means the model is more tolerant of margin violations.
• If $C = 0$, then the solution is equivalent to that of the maximal margin classifier.
• A smaller $C$ corresponds with high variance and low bias; a larger $C$ corresponds with high bias and low variance.

The support vectors in this model (the only observations that affect the hyperplane) are the observations that are on the margin or that violate the margin. Support vector classifiers are robust to observations far away from the hypersplane, unlike methods such as linear discriminant analysis.

### III. Support Vector Machines

In non-linear decision boundaries, we can modify the $y_i \bigl( f(x) \bigr) \geq M (1 - \epsilon_i)$ condition in the support vector classifier to include non-linear functions.

The solution to the linear support vector classifier involves only the inner products of the observations, $\langle x, x_i \rangle = \sum_{j = 1}^p x_{ij} x_{i'j}$. It can be shown that the linear support vector classifier can be represented by:

Where:

• There are $n \choose 2$ inner products between all pairs of training observations needed to estimate the parameters $\alpha_1, ..., \alpha_n, \beta_0$.
• $S$ is the collection of indices of support points, since $\alpha_i$ is only non-zero for support vectors.

We can generalize the support vector classifier with a kernel, $K(x_i, x_{i'})$, a function that quantifies the similarity of two observations, instead of the inner product (linear kernel).

Support vector machines use the support vector classifier with a non-linear kernel:

Examples of different kernels:

• Linear Kernel (inner product):
• Polynomial Kernel of degree d
• Radial Kernel (very local behavior)

SVMs are computational nice, since they only need to calculate $K(x_i, x_{i'})$ for $n \choose 2$ distinct pairs $i, i'$. This can be done without explicitly working in an enlarged feature space. The radial kernel feature space is implicit and infinite-dimensional, so we need a kernel to use it.

An ROC curve of the false positive rate against the true positive rate can help graphically compare the performance of SVMs.

Extension to more than two classes

One-versus-one classification

• Constructs $k \choose 2$ SVMs, each of which compares a pair of classes.
• Classify a test observation into each of the $k \choose 2$ pairs, tally the number of times it’s classified into each of those $K$ classes, and assign it to the class for which it was the most frequently assigned.

One-versus-all classification

• Fit $K$ SVMs. For each fit, compare one of the $K$ classes to the other $K - 1$ classes.
• Assign observations to the class for which $\beta_{0k} + \beta_{1k}x_1^* + ... + \beta_{pk} x_p^*$ is largest.

Note: Support vector regression also exists.

Relationship to logistic regression

The loss + penalty function of SVMs, shown below, is very similar to the loss function of logistic regression.

$\lambda$ is a nonnegative tuning parameter. When $\lambda$ is large, then the coefficients are small and more violations to the margin are tolerated for lower variance and higher bias. This equation takes the “Loss + Penalty” form that many other models take, and the loss function of an SVM is known as “hinge loss.”

SVMs perform better when classes are well-separated, while logistic regression performs better with overlapping classes.

A graphical comparison of SVM loss vs. logistic regression loss is shown below.2 ### IV. Applications in R

1. A hyperplane in $p$-dimensional space is a flat affine subspace of dimension $p - 1$, and is defined by $\beta_0 + \beta_1 X_1 + ... + \beta_p X_p = 0$

2. Source: James et. al, Intro to Statistical Learning 7th edition, pg. 358