Linear Model Selection and Regularization

An alternative fitting method to least squares, such as subset selection, shrinkage (ridge regression, lasso), and dimension reduction techniques (principle components analysis, partial least squares) can help prediction accuracy and model interpretability.

April 27, 2020 - 20 minute read -

I. Overview

Why use an alternative fitting model to least squares?

• Prediction Accuracy
• If $n$ is not much larger than $p$, then least squares can have a lot of variability
• If $% $, then there is no longer a unique least squares coefficient estimate, and we must constrain or shrink the coefficients
• Model Interpretability
• Remove irrelevant variables with feature/variable selection

Alternatives to Least Squares

1. Subset selection: select a subset of predictors before fitting least squares
2. Shrinkage: fit model with all predictors, with coefficients shrunk toward 0 relative to the least squares estimate
3. Dimension reduction: projecting predictors onto an $M$-dimensional sbspace where $% $ by computing projections/linear combinations of the variables. These $M$ projections are then used as predictors for least squares.

II. Subset Selection

Best Subset Selection

1. Let $M_0$ denote the null model.
2. For $k = 1, 2, ..., p$, fit all models with $k$ predictors, and choose the best one (smallest RSS/largest $R^2$), $M_k$.
3. Select the best model from $M_1, ..., M_k$ using a model selection metric such as adjusted $R^2$, AIC, BIC, or cross-validated prediction error.

Note: this is very computationally expensive, and infeasible for $p > 40$, since there are $2^p$ total models to check. In addition this method has high variance which can lead to overfitting.

Forward Stepwise Selection

1. Let $M_0$ denote the null model.
2. For $k = 0, 2, ..., p - 1$, consider all $p - k$ models that add one additional predictor to $M_k$, then choose the best one, $M_{k+1}$.
3. Select the best model from $M_0, ..., M_p$ using a model selection metric.

Note: this is much more computationally feasible than best subset selection with $1 + \frac{p(p+1)}{2}$ total models to check, but it’s not guaranteed to yield the “best” model for the training data. When $p>n$, this is the only feasible subset selection method.

Backward Stepwise Selection

1. Let $M_0$ denote the null model.
2. For $k = p, p-1, ..., 1$, consider all $k$ models that contain all but 1 of the predictors in $M_k$, a total of $k-1$ predictors, then choose the best one, $M_{k+1}$.
3. Select the best model from $M_0, ..., M_p$ using a model selection metric.

Note: Like forward selection, this only searches through $1 + \frac{p(p+1)}{2}$ models, and is not guaranteed to yield the best model. It cannot be used with $p>n$. Hybrid approaches (forward-backward) may be used as well.

Model selection statistics

We can choose the optimal model either by estimate the test error indirectly by adjusting the training error to account for overfitting bias, or directly by using a validation set or cross-validation.

• $C_p = \frac{1}{n} (\text{RSS} + 2 d \hat \sigma^2)$ ;
• Penalizes the model based on the number of predictors, $d$. Lower is better.
• If $\hat \sigma^2$ is an unbiased estimator of $\sigma^2$, then $C_p$ is an unbiased estimator of the test MSE.
• $\text{AIC} = \frac{1}{n \hat \sigma^2} (\text{RSS} + 2 d \hat \sigma^2)$ ;
• Use this for models fit by maximum likelihood.
• With gaussian errors, maximum likelihood = least squares.
• $\text{BIC} = \frac{1}{n \hat \sigma^2} (\text{RSS} + \log {n} d \hat \sigma^2)$ ;
• Use this for least squares model with $d$ predictors.
• Has a heavier penalty for models with many variables.
• $\text{Adj. } R^2 = 1 - \frac{\text{RSS} / (n - d - 1)} {TSS / (n-1)}$ ;
• Larger is better.

Note: These formulas are all for linear least squares, and can be generalized for additional model types.

Cross Validation or validation set error:

• Makes fewer assumptions about the underlying model, and can be used when the number of predictors is unknown, or $\hat \sigma^2$ cannot be estimated.
• One Standard Error Rule: choose the smallest model for which the estimated test MSE is within one standard error of the lowest point on the curve.

III. Shrinkage

Shrinking coefficient estimates can significantly reduce model variance.

Ridge Regression

While least squares minimizes RSS, ridge regression picks coefficient estimates $\hat \beta^R$ to minimize:

Where:

• $\text{RSS} = \sum_{i = 1}^n \Bigl( _i - \beta_0 - \sum_{j = 1}^p \beta_j x_{ij} \Bigr) ^ 2$ ;
• $\lambda \geq 0$ is a tuning parameter
• $\lambda \sum_j \beta_j^2$ is the shrinkage penalty, and grows as $\lambda \to \infty$

It is best to perform ridge regression after standardizing the predictors (note the denomenator is the standard deviation of the $j$th predictor):

Improvements over least squares:

• Bias-variance tradeoff: as $\lambda$ increases, variance decreases and bias increases.
• In cases where relationships between predictors and response is approximate linear, then least squares will have low bias but may have high variance, in particular when the number of variables is almost as big, or bigger, than the sample size.
• Ridge regression works best when the least squares estimate has high variance.
• Computationally superior to best subset selection, since solving for all $\lambda$ is almost computationally equivalent to fitting least squares.

Drawback:

• Ridge regression will include all $p$ predictors in the final model, which presents challenges for interpretability.

Lasso

The lasso coefficients $\hat \beta_{\lambda}^L$ minimize the equation:

Where $\text{RSS } = \sum_{i = 1}^n \Bigl( _i - \beta_0 - \sum_{j = 1}^p \beta_j x_{ij} \Bigr) ^ 2$

Notes:

• Bias-variance tradeoff: as $\lambda$ increases, variance decreases and bias increases.
• Lasso uses an $l_1$ penalty instead of the $l_2$ penalty of ridge regression.
• The $l_1$ norm of coefficient vector $\beta$ is $\lVert \beta \rVert _1 = \sum \vert \beta_j \vert$.
• Performs variable selection by forcing some coefficients to 0, and yields sparse models.

Comparison of Ridge Regression and Lasso

Ridge Regression:

• $\min_{\beta} \text{\{RSS\}} \text{ subject to } \sum_{j = 1}^p \beta_j^2 \leq s$ ;
• uses $l_2$ penalty
• computationally feasible alternative to best subset selection
• outperforms lasso when all variables are related to the response, or when there are many predictors of relatively equal size.

Lasso:

• $\min_{\beta} \text{\{RSS\}} \text{ subject to } \sum_{j = 1}^p \vert \beta_j \vert \leq s$ ;
• uses $l_1$ penalty
• computationally feasible alternative to best subset selection
• outperforms ridge when only a few predictors have substantial coefficients, and the rest are very small or 0. It also has advantages in that it performs variable selection and is easier to interpret.

The tuning parameter $\lambda$ should be selected with cross validation by selecting a grid of $\lambda$ values, computing the cross-validation error for each $\lambda$, and selecting $\lambda$ for which cv error is smallest.

The classic image below1 illustrates the difference between ridge regression and lasso in two dimensions. Points along each red ellipse have equal RSS, and the blue areas represent the $l_1$ and $l_2$ constraints. The ridge and lasso solutions are where the red ellipses touch the blue areas.

Further intuition for a simple case is given below2 – we see that ridge regression shrinks every dimension of data by the same proportion, while lasso shrinks all coefficients toward 0 by a similar amount, and sufficiently small coefficients are shrunk to 0.

Extra Theory: a bayesian interpretation for ridge regression and lasso

Suppose that:

• a coefficient vector $\beta$ has a prior distribution $p(\beta)$, where $\beta = (\beta_0, ..., \beta_p)$
• the likelihood of the data can be written as $f(Y \vert X, \beta)$, where $X = (X_1, ..., X_p)$

Then, multipling the prior distribution by the likelihood gives the posterior distribution up to a proportionality constant:

Assume:

• $Y = \beta_0 + X_1 \beta_1 + ... + X_p \beta_p + \epsilon$ ; where $\epsilon \sim \text{iid} N(\mu, \sigma)$
• $p(\beta) = \Pi_{j = 1}^P g(\beta_j)$ for some density function $g$

Then,

• If $g$ is a gaussian distribution with mean 0 and standard deviation that is a function of $\lambda$, then the posterior mode (most likely value for $\beta$ given the data) for $\beta$ is given by the ridge regression solution.
• If $g$ is a double-exponential (Laplace) distribution with mean 0 and scale that is a function of $\lambda$, then the posterior mode for $\beta$ is given by the lasso solution.
• However, note that the lasso solution is not the posterior mean, and the posterior mean doesn’t yield a sparse coefficient vector.

A visualization of the gaussian and double-exponential distributions:3

IV. Dimension Reduction Techniques

These techniques transform the predictors, before using least squares to fit a model with the transformed predictors. When the number of predictors is large relative to the sample size, then significantly reducing the number of dimensions can significantly reduce variance. In more formal terms…

Let $Z_1, ..., Z_m$ represent $% $ linear combinations of the original $p$ predictors:

Then using least squares, fit the linear regression model: $y_i = \theta_0 + \sum_{m = 1}^M \theta_m z_{im} + \epsilon_i$

• The regression coefficients are $\theta_0, ..., \theta_m$
• Dimensions are reduced from $p+1$ to $M+1$
• Coefficients are restrained by $\beta_j = \sum_{i = 1}^M \theta \phi_{jm}$

Principle Components Analysis

The first principle components vector is chosen from all combinations of predictors such that:

• $\phi_{11}^2 + ... + \phi_{pm}^2 = 1$ ;
• captures the highest variance of predictors
• defines the line that is as close as possible to the data
• the first principle components score for the $i$th observation is the distance in the $x$-direction of the $i$th projection from 0, the mean point of all predictors.
• all principle components vectors are orthogonal to each other

The second principle component $Z_2$ is a linear combination of the variables that is uncorrelated with $Z_1$ and has the largest variance, which means it must be orthogonal to the first principal component direction. Other principle components can be calculated in the same way.

Principle components regression

• assume that the directions in which $X_1, ..., X_p$ show the most variation are the directions that are associated with $Y$
• using more principle components means lower bias, higher variance
• this is NOT a feature selection method, since it combines all $p$ original features
• closely related to ridge regression

Notes:

• choose the number of principle components $M$ using cross validation
• standardize each predictor before generating principle components, so that high variance variables don’t dominate

Partial Least Squares

Partial Least Squares (PLS) is a supervised alternative to PCA that identifies new features $Z_1, ..., Z_m$ that are combinations of the original predictors and related to the response. The biggest drawback of PCA is that its components are not guaranteed to be the ones that best explain the response, since it’s unsupervised.

• compute the first PLS direction, $Z_1 = \sum_{j = 1}^p \phi_{j1} X_j$, by setting each $\phi_{j1}$ to the simple linear regression coefficients of $Y \sim X_j$.
• this places the highest weights on variables most strongly correlated with the response when calculating $Z_1$
• does not fit the predictors as well as PCA, but does a better job explaining the response
• compute the second PLS direction, $Z_2$, by regressing each variable on $Z_1$ and taking the residuals, then compute $Z_2$ by using the orthogonalized data in the same way as $Z_1$.
• Choose the number of predictors $M$ with cross-validation.

This method is popular in chemometrics, but in practice often performs no better than PCR or ridge regression. Relative to PCR, PLS has lower bias and higher variance.

Considerations in High Dimensions

What goes wrong when $p \geq n$?

• be careful of overfitting – always evaluate perfromance on an independent test set
• $C_p$, AIC, BIC cannot be used because estimating $\hat \sigma^2$ is problematic. Adjusted $R^2$ also can’t be used.

Tips:

• forward stepwise selection, ridge regression, lasso, and PCR are useful
• regularization/shrinkage plays a big role
• tuning parameter selection is very important
• test error will increase as dimensions increase unless additional features are truly predictive (curse of dimensionality)

Interpreting results

• In high dimensions, any variable can be written as a linear combination of the others. For this reason, we can’t know which variables are truly predictive of the outcome, or which coefficients are best.
• All models are only one of many possible models for prediction, and must be further validated on independent data sets.
• Never use the sum of squared errors, p-values, $R^2$, or others on training data as evidence of good fit. Instead, use either cross-validation results or test on an independent test set.

V. Applications in R

1. Source: James et. al, Intro to Statistical Learning 7th edition, pg. 222

2. Source: James et. al, Intro to Statistical Learning 7th edition, pg. 226

3. Source: James et. al, Intro to Statistical Learning 7th edition, pg. 227