Linear Model Selection and Regularization

An alternative fitting method to least squares, such as subset selection, shrinkage (ridge regression, lasso), and dimension reduction techniques (principle components analysis, partial least squares) can help prediction accuracy and model interpretability.

April 27, 2020 - 20 minute read -
academic

Contents

I. Overview

Why use an alternative fitting model to least squares?

  • Prediction Accuracy
    • If is not much larger than , then least squares can have a lot of variability
    • If , then there is no longer a unique least squares coefficient estimate, and we must constrain or shrink the coefficients
  • Model Interpretability
    • Remove irrelevant variables with feature/variable selection

Alternatives to Least Squares

  1. Subset selection: select a subset of predictors before fitting least squares
  2. Shrinkage: fit model with all predictors, with coefficients shrunk toward 0 relative to the least squares estimate
  3. Dimension reduction: projecting predictors onto an -dimensional sbspace where by computing projections/linear combinations of the variables. These projections are then used as predictors for least squares.



II. Subset Selection

Best Subset Selection

  1. Let denote the null model.
  2. For , fit all models with predictors, and choose the best one (smallest RSS/largest ), .
  3. Select the best model from using a model selection metric such as adjusted , AIC, BIC, or cross-validated prediction error.

Note: this is very computationally expensive, and infeasible for , since there are total models to check. In addition this method has high variance which can lead to overfitting.

Forward Stepwise Selection

  1. Let denote the null model.
  2. For , consider all models that add one additional predictor to , then choose the best one, .
  3. Select the best model from using a model selection metric.

Note: this is much more computationally feasible than best subset selection with total models to check, but it’s not guaranteed to yield the “best” model for the training data. When , this is the only feasible subset selection method.

Backward Stepwise Selection

  1. Let denote the null model.
  2. For , consider all models that contain all but 1 of the predictors in , a total of predictors, then choose the best one, .
  3. Select the best model from using a model selection metric.

Note: Like forward selection, this only searches through models, and is not guaranteed to yield the best model. It cannot be used with . Hybrid approaches (forward-backward) may be used as well.

Model selection statistics

We can choose the optimal model either by estimate the test error indirectly by adjusting the training error to account for overfitting bias, or directly by using a validation set or cross-validation.

Adjusting the training error:

  • ;
    • Penalizes the model based on the number of predictors, . Lower is better.
    • If is an unbiased estimator of , then is an unbiased estimator of the test MSE.
  • ;
    • Use this for models fit by maximum likelihood.
    • With gaussian errors, maximum likelihood = least squares.
  • ;
    • Use this for least squares model with predictors.
    • Has a heavier penalty for models with many variables.
  • ;
    • Larger is better.

Note: These formulas are all for linear least squares, and can be generalized for additional model types. m Cross Validation or validation set error:

  • Makes fewer assumptions about the underlying model, and can be used when the number of predictors is unknown, or cannot be estimated.
  • One Standard Error Rule: choose the smallest model for which the estimated test MSE is within one standard error of the lowest point on the curve.



III. Shrinkage

Shrinking coefficient estimates can significantly reduce model variance.

Ridge Regression

While least squares minimizes RSS, ridge regression picks coefficient estimates to minimize:

Where:

  • ;
  • is a tuning parameter
  • is the shrinkage penalty, and grows as

It is best to perform ridge regression after standardizing the predictors (note the denomenator is the standard deviation of the th predictor):

Improvements over least squares:

  • Bias-variance tradeoff: as increases, variance decreases and bias increases.
  • In cases where relationships between predictors and response is approximate linear, then least squares will have low bias but may have high variance, in particular when the number of variables is almost as big, or bigger, than the sample size.
  • Ridge regression works best when the least squares estimate has high variance.
  • Computationally superior to best subset selection, since solving for all is almost computationally equivalent to fitting least squares.

Drawback:

  • Ridge regression will include all predictors in the final model, which presents challenges for interpretability.

Lasso

The lasso coefficients minimize the equation:

Where

Notes:

  • Bias-variance tradeoff: as increases, variance decreases and bias increases.
  • Lasso uses an penalty instead of the penalty of ridge regression.
  • The norm of coefficient vector is .
  • Performs variable selection by forcing some coefficients to 0, and yields sparse models.

Comparison of Ridge Regression and Lasso

Ridge Regression:

  • ;
  • uses penalty
  • computationally feasible alternative to best subset selection
  • outperforms lasso when all variables are related to the response, or when there are many predictors of relatively equal size.

Lasso:

  • ;
  • uses penalty
  • computationally feasible alternative to best subset selection
  • outperforms ridge when only a few predictors have substantial coefficients, and the rest are very small or 0. It also has advantages in that it performs variable selection and is easier to interpret.

The tuning parameter should be selected with cross validation by selecting a grid of values, computing the cross-validation error for each , and selecting for which cv error is smallest.

The classic image below1 illustrates the difference between ridge regression and lasso in two dimensions. Points along each red ellipse have equal RSS, and the blue areas represent the and constraints. The ridge and lasso solutions are where the red ellipses touch the blue areas.

Lasso vs Ridge

Further intuition for a simple case is given below2 – we see that ridge regression shrinks every dimension of data by the same proportion, while lasso shrinks all coefficients toward 0 by a similar amount, and sufficiently small coefficients are shrunk to 0.

Lasso vs Ridge 2

Extra Theory: a bayesian interpretation for ridge regression and lasso

Suppose that:

  • a coefficient vector has a prior distribution , where
  • the likelihood of the data can be written as , where

Then, multipling the prior distribution by the likelihood gives the posterior distribution up to a proportionality constant:

Assume:

  • ; where
  • for some density function

Then,

  • If is a gaussian distribution with mean 0 and standard deviation that is a function of , then the posterior mode (most likely value for given the data) for is given by the ridge regression solution.
  • If is a double-exponential (Laplace) distribution with mean 0 and scale that is a function of , then the posterior mode for is given by the lasso solution.
    • However, note that the lasso solution is not the posterior mean, and the posterior mean doesn’t yield a sparse coefficient vector.

A visualization of the gaussian and double-exponential distributions:3

Lasso vs Ridge 3



IV. Dimension Reduction Techniques

These techniques transform the predictors, before using least squares to fit a model with the transformed predictors. When the number of predictors is large relative to the sample size, then significantly reducing the number of dimensions can significantly reduce variance. In more formal terms…

Let represent linear combinations of the original predictors:

Then using least squares, fit the linear regression model:

  • The regression coefficients are
  • Dimensions are reduced from to
  • Coefficients are restrained by

Principle Components Analysis

The first principle components vector is chosen from all combinations of predictors such that:

  • ;
  • captures the highest variance of predictors
  • defines the line that is as close as possible to the data
  • the first principle components score for the th observation is the distance in the -direction of the th projection from 0, the mean point of all predictors.
  • all principle components vectors are orthogonal to each other

The second principle component is a linear combination of the variables that is uncorrelated with and has the largest variance, which means it must be orthogonal to the first principal component direction. Other principle components can be calculated in the same way.

Principle components regression

  • assume that the directions in which show the most variation are the directions that are associated with
  • using more principle components means lower bias, higher variance
  • this is NOT a feature selection method, since it combines all original features
  • closely related to ridge regression

Notes:

  • choose the number of principle components using cross validation
  • standardize each predictor before generating principle components, so that high variance variables don’t dominate

Partial Least Squares

Partial Least Squares (PLS) is a supervised alternative to PCA that identifies new features that are combinations of the original predictors and related to the response. The biggest drawback of PCA is that its components are not guaranteed to be the ones that best explain the response, since it’s unsupervised.

  • compute the first PLS direction, , by setting each to the simple linear regression coefficients of .
    • this places the highest weights on variables most strongly correlated with the response when calculating
    • does not fit the predictors as well as PCA, but does a better job explaining the response
  • compute the second PLS direction, , by regressing each variable on and taking the residuals, then compute by using the orthogonalized data in the same way as .
  • Choose the number of predictors with cross-validation.

This method is popular in chemometrics, but in practice often performs no better than PCR or ridge regression. Relative to PCR, PLS has lower bias and higher variance.

Considerations in High Dimensions

What goes wrong when ?

  • be careful of overfitting – always evaluate perfromance on an independent test set
  • , AIC, BIC cannot be used because estimating is problematic. Adjusted also can’t be used.

Tips:

  • forward stepwise selection, ridge regression, lasso, and PCR are useful
  • regularization/shrinkage plays a big role
  • tuning parameter selection is very important
  • test error will increase as dimensions increase unless additional features are truly predictive (curse of dimensionality)

Interpreting results

  • In high dimensions, any variable can be written as a linear combination of the others. For this reason, we can’t know which variables are truly predictive of the outcome, or which coefficients are best.
  • All models are only one of many possible models for prediction, and must be further validated on independent data sets.
  • Never use the sum of squared errors, p-values, , or others on training data as evidence of good fit. Instead, use either cross-validation results or test on an independent test set.



V. Applications in R

### exercise!!!! 
### using COLLEGE data - predict Apps (# applications)
library("ISLR")
library("pls")
library("glmnet")
library("purrr")
library("leaps")
library("tidyverse")
library("patchwork")
library("ggplot2")
# prepare data
college <- College
set.seed(1)
split <- sample(nrow(College), nrow(College) * .7)
train <- college[split,]
test <- college[-split,]
train_matrix <- model.matrix(Apps ~ ., data = train)
test_matrix  <- model.matrix(Apps ~ ., data = test)

mse <- function(pred, test) {
  return(mean((pred - test) ^ 2))
}

rsq <- function(pred, test) {
  return(1 - ((pred - test) ^ 2) / ((mean(pred) - test) ^ 2))
}

# least squares
model_lsq <- lm(Apps ~ ., data = train)
pred_lsq <- predict(model_lsq, test)

mse(pred_lsq, test$Apps) # MSE: 1261630
rsq(pred_lsq, test$Apps) # R-Squared: .9135

# least squares - best subset
model_lsq_best <- regsubsets(Apps ~ ., data = train, nvmax = 17)
model_lsq_best_summ <- summary(regsubsets(Apps ~ ., data = train, nvmax = 17))
model_lsq_best_selection <- tibble(index = seq(1,length(model_lsq_best_summ$rss)),
                                   rss = model_lsq_best_summ$rss,
                                   adjr2 = model_lsq_best_summ$adjr2,
                                   cp = model_lsq_best_summ$cp,
                                   bic = model_lsq_best_summ$bic)

    # model performance by RSS, adj R2, cp, and bic
  p1 <- ggplot(model_lsq_best_selection, aes(index, rss)) +
    geom_line() 
  
  max_adjr2_loc <- which.max(model_lsq_best_selection$adjr2) 
  p2 <- ggplot(model_lsq_best_selection, aes(index, adjr2)) +
    geom_line() + 
    annotate("point", x = max_adjr2_loc, y = model_lsq_best_selection$adjr2[max_adjr2_loc], color = "blue")
  
  min_cp_loc <- which.min(model_lsq_best_selection$cp) 
  p3 <- ggplot(model_lsq_best_selection, aes(index, cp)) +
    geom_line() + 
    annotate("point", x = min_cp_loc, y = model_lsq_best_selection$cp[min_cp_loc], color = "blue")
  
  min_bic_loc <- which.min(model_lsq_best_selection$bic)
  p4 <- ggplot(model_lsq_best_selection, aes(index, bic)) +
    geom_line() + 
    annotate("point", x = min_bic_loc, y = model_lsq_best_selection$bic[min_bic_loc], color = "blue")

  (p1 + p2) / (p3 + p4)
    
errors <- rep(NA, 17) # test data cross validation error
for (i in 1:17) {
  coefi <- coef(model_lsq_best, id = i)
  pred <- test_matrix[, names(coefi)] %*% coefi
  errors[i] <- mse(pred, test$Apps)
}
which.min(errors) # 12 - MSE: 1256814
plot(errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="red")
    
coef(model_lsq_best, which.min(errors))


# least squares - forward selection
model_lsq_forward <- regsubsets(Apps ~ ., data = train, nvmax = 18, method = "forward")

errors <- rep(NA, 17) # test data cross validation error
model_rsq <- rep(NA, 17)
for (i in 1:17) {
  coefi <- coef(model_lsq_forward, id = i)
  pred <- test_matrix[, names(coefi)] %*% coefi
  errors[i] <- mse(pred, test$Apps)
  model_rsq[i] <- rsq(pred, test$Apps)
}
which.min(errors) # 12 - MSE: 1256814
model_rsq[12] # R-Squared: .9138
plot(errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="red")

coef(model_lsq_forward, which.min(errors))

  ## ALTERNATE WAY - select automatically by AIC
model_null <- lm(Apps ~ 1, data = train)
model_full <- lm(Apps ~ ., data = train)

model_lsq_forward_2 <- step(model_null, scope = list(lower = model_null, upper = model_full), direction = "forward")
pred_lsq_forward_2 <- predict(model_lsq_forward_2, test)

mse(pred_lsq_forward_2, test$Apps) # MSE: 1259145
rsq(pred_lsq_forward_2, test$Apps) # R-squared: .9136

# least squares - backward selection
model_lsq_backward <- regsubsets(Apps ~ ., data = train, nvmax = 18, method = "backward")

errors <- rep(NA, 17) # test data cross validation error
for (i in 1:17) {
  coefi <- coef(model_lsq_backward, id = i)
  pred <- test_matrix[, names(coefi)] %*% coefi
  errors[i] <- mse(pred, test$Apps)
}
which.min(errors) # 12 - MSE: 1256814
plot(errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="red")

coef(model_lsq_bbackward, which.min(errors))

# ridge regression
grid <- 10 ^ seq(4, -2, length = 100)

cv_ridge <- cv.glmnet(train_matrix, train$Apps, alpha = 0, lambda = grid, thres = 1e-12)
best_lambda <- cv_ridge$lambda.min

model_ridge <- glmnet(train_matrix, train$Apps, alpha = 0, lambda = best_lambda, thres = 1e-12)
pred_ridge  <- predict(model_ridge, s = best_lambda, newx = test_matrix)

mse(pred_ridge, test$Apps) # MSE: 1261599
rsq(pred_ridge, test$Apps) # R-Squared: .9135

# lasso
cv_lasso <- cv.glmnet(train_matrix, train$Apps, alpha = 1, lambda = grid, thres = 1e-12)
best_lambda_lasso <- cv_lasso$lambda.min

model_lasso <- glmnet(train_matrix, train$Apps, alpha = 1, lambda = best_lambda_lasso, thres = 1e-12)
pred_lasso  <- predict(model_lasso, s = best_lambda_lasso, newx = test_matrix)
coef(model_lasso)

mse(pred_lasso, test$Apps) # MSE: 1261591
rsq(pred_lasso, test$Apps) # R-Squared: .9135

# pcr
model_pcr <- pcr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
validationplot(model_pcr, val.type = "MSEP")

pred_pcr <- predict(model_pcr, test, ncomp = 17)

mse(pred_pcr, test$Apps) # MSE: 1261630
rsq(pred_pcr, test$Apps) # R-Squared: .9135

# pls
model_pls <- plsr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
validationplot(model_pls, val.type = "MSEP")

pred_pls <- predict(model_pls, test, ncomp = 9)

mse(pred_pls, test$Apps) # MSE: 1286077
rsq(pred_pls, test$Apps) # R-Squared: .9118
  1. Source: James et. al, Intro to Statistical Learning 7th edition, pg. 222 

  2. Source: James et. al, Intro to Statistical Learning 7th edition, pg. 226 

  3. Source: James et. al, Intro to Statistical Learning 7th edition, pg. 227 

←Index