Model Tuning and Overfitting

An overview of the model tuning process, including data splitting, resampling techniques, and recommendations for choosing parameters and models.

May 13, 2020 - 7 minute read -

Note: Because I covered similar material in my Resampling Methods post, I won’t be going into great detail about the techniques described here.

I. Model Tuning

Behold! A nice picture of the parameter tuning process.1

Model Tuning Overview

Overfitting is when the model has learned the characteristics of a particular sample’s unique noise, and will have poor accuracy when predicting a new sample. Splitting the data into test and training sets or using resampling techniques to measure error mitigates the effect of overfitting.

Data Splitting

When is not large, a strong case can be made for using cross-validation instead of splitting the data, which will allow the model to learn on more data points. Types of data splitting include:

  • A simple random sample of the data
  • Stratified random sampling of the data, for balanced outcomes in the test dataset
  • Maximum dissimilarity sampling: initialize the test set with a single sample, and allocate the most dissimilar unallocated sample (must specify a dissimilarity measure). Repeat until the target test set size is achieved.

Resampling Techniques

  • K-fold cross-validation
    • In practice, or are commonly used.
    • A bigger leads to lower bias and higher variance.
    • Generally has high variance compared to other validation methods.
    • Leave-one-out cross-validation is a special case where .
  • Repeated Training/Test Splits
    • AKA, “Monte Carlo Cross-Validation” or “Leave-group-out Cross-Validation”
    • Rule of thumb is to use around 75-80% of the data for training splits.
    • Use a large number of repetitions (say, 50 to 200+)
  • Bootstrap error
    • Fit the model on a bootstrap sampling of the data, then predict the out-of-bag data points for an error estimate
    • In general, less uncertainty than k-fold CV and has similar bias to k-fold CV with
    • On average, 63.2% of the data points are represented in the bootstrap sample.
    • An alternate “632 method” for estimating error is

II. Recommendations

Choosing the final tuning parameters:

  1. Choose the model associated with the numerically best performance estimates.
  2. Choose the simplest model whose performance is within a single standard error of the numerically best value (one-standard-error method).
  3. Choose the simplest model that is within a certain tolerance of the numerically best value (% decrease in performance from the numerically optimal value, , can be calculated by ).

Data Splitting and Resampling:

  • For small datasets, use 10-fold cross validation.
  • No resampling method is uniformly better than another.
  • For large sample sizes, differences become less pronounced and computational efficiency becomes more important.
  • For choosing between models, consider using one of the bootstrap procedures since they have very low variance.

Choosing between models:

  1. Start with the most flexible models (boosted trees, SVMs) to get a sense of the empirically optimum results.
  2. Investigate simpler, more interpretable models (MARS, PLS, GAMs, Naïve Bayes).
  3. Consider using the simplest model that reasonably approximates the performance of the more complex models.
  • Note: A paired t-test can be used to evaluate the hypothesis that the models have equivalent average accuracies

III. Applications in R

# Data Splitting ----------------------------------------------------------

data(twoClassData) # predictors; classes

# stratified random sampling
trainingRows <- createDataPartition(classes, p = .8, list = FALSE)

trainPredictors <- predictors[trainingRows, ]
trainClasses <- classes[trainingRows]
testPredictors <- predictors[-trainingRows, ]
testClasses <- classes[-trainingRows]

# maximum dissimilarity sampling
# maxDissim()

# Resampling --------------------------------------------------------------


# repeated training/test splits
repeatedSplits <- createDataPartition(trainClasses, p = .8, times = 3)

# indicators for 10-fold CV
cvSplits <- createFolds(trainClasses, k = 10, returnTrain = TRUE)

fold1 <- cvSplits[[1]]
cvPredictors1 <- trainPredictors[fold1, ]
cvClasses1 <- trainClasses[fold1]

# Model Building ----------------------------------------------------------


trainPredictors <- as.matrix(trainPredictors)
(knnFit <- knn3(x = trainPredictors, y = trainClasses, k = 5))

testPredictions <- predict(knnFit, newdata = testPredictors, type = "class")
confusionMatrix(table(testClasses, testPredictions)) # 66% accuracy

# Determining Tuning Parameters -------------------------------------------
# NOTE: the code in this section is an edited version of code from the 'chapters'
# directory of the AppliedPredictiveModeling package that is included  
# for reference/educational purposes only.


## Assume we have completed some data pre-processing and split the 
## data into GermanCreditTrain and GermanCreditTest training/test sets

sigDist <- sigest(Class ~ ., data = GermanCreditTrain, frac = 1)
svmTuneGrid <- data.frame(sigma = as.vector(sigDist)[1], C = 2^(-2:7))

svmFit <- train(Class ~ .,
                data = GermanCreditTrain,
                method = "svmRadial",
                preProc = c("center", "scale"),
                tuneGrid = svmTuneGrid,
                trControl = trainControl(method = "repeatedcv", 
                                         repeats = 5,
                                         classProbs = TRUE))

# note--different resampling methods in trainControl include:
# cv, LOOCV, LGOCV, boot, boot632

## Print the results


## A line plot of the average performance. The 'scales' argument is actually an 
## argument to xyplot that converts the x-axis to log-2 units.

plot(svmFit, scales = list(x = list(log = 2)))

## Test set predictions

predictedClasses <- predict(svmFit, GermanCreditTest)

## Use the "type" option to get class probabilities

predictedProbs <- predict(svmFit, newdata = GermanCreditTest, type = "prob")

# Choosing Between Models -------------------------------------------------
# continued from previous section

glmProfile <- train(Class ~ .,
                    data = GermanCreditTrain,
                    method = "glm",
                    trControl = trainControl(method = "repeatedcv", 
                                             repeats = 5))

resamp <- resamples(list(SVM = svmFit, Logistic = glmProfile))

modelDifferences <- diff(resamp)

## The actual paired t-test:
  1. Source Kuhn and Johnson, Applied Predictive Modeling (2013), pg. 66