# Model Tuning and Overfitting

An overview of the model tuning process, including data splitting, resampling techniques, and recommendations for choosing parameters and models.

May 13, 2020 - 7 minute read -

### Contents

Note: Because I covered similar material in my Resampling Methods post, I won’t be going into great detail about the techniques described here.

### I. Model Tuning

Behold! A nice picture of the parameter tuning process.1

Overfitting is when the model has learned the characteristics of a particular sample’s unique noise, and will have poor accuracy when predicting a new sample. Splitting the data into test and training sets or using resampling techniques to measure error mitigates the effect of overfitting.

Data Splitting

When $n$ is not large, a strong case can be made for using cross-validation instead of splitting the data, which will allow the model to learn on more data points. Types of data splitting include:

• A simple random sample of the data
• Stratified random sampling of the data, for balanced outcomes in the test dataset
• Maximum dissimilarity sampling: initialize the test set with a single sample, and allocate the most dissimilar unallocated sample (must specify a dissimilarity measure). Repeat until the target test set size is achieved.

Resampling Techniques

• K-fold cross-validation
• In practice, $K=5$ or $K=10$ are commonly used.
• A bigger $K$ leads to lower bias and higher variance.
• Generally has high variance compared to other validation methods.
• Leave-one-out cross-validation is a special case where $K = n$.
• Repeated Training/Test Splits
• AKA, “Monte Carlo Cross-Validation” or “Leave-group-out Cross-Validation”
• Rule of thumb is to use around 75-80% of the data for training splits.
• Use a large number of repetitions (say, 50 to 200+)
• Bootstrap error
• Fit the model on a bootstrap sampling of the data, then predict the out-of-bag data points for an error estimate
• In general, less uncertainty than k-fold CV and has similar bias to k-fold CV with $k \approx 2$
• On average, 63.2% of the data points are represented in the bootstrap sample.
• An alternate “632 method” for estimating error is $.632 * \text{ Bootstrap Error Rate } + .368 * \text{ Apparent Error Rate}$

### II. Recommendations

Choosing the final tuning parameters:

1. Choose the model associated with the numerically best performance estimates.
2. Choose the simplest model whose performance is within a single standard error of the numerically best value (one-standard-error method).
3. Choose the simplest model that is within a certain tolerance of the numerically best value (% decrease in performance from the numerically optimal value, $O$, can be calculated by $(X - O) / O$).

Data Splitting and Resampling:

• For small datasets, use 10-fold cross validation.
• No resampling method is uniformly better than another.
• For large sample sizes, differences become less pronounced and computational efficiency becomes more important.
• For choosing between models, consider using one of the bootstrap procedures since they have very low variance.

Choosing between models:

1. Start with the most flexible models (boosted trees, SVMs) to get a sense of the empirically optimum results.
2. Investigate simpler, more interpretable models (MARS, PLS, GAMs, Naïve Bayes).
3. Consider using the simplest model that reasonably approximates the performance of the more complex models.
• Note: A paired t-test can be used to evaluate the hypothesis that the models have equivalent average accuracies

### III. Applications in R

1. Source Kuhn and Johnson, Applied Predictive Modeling (2013), pg. 66