Note: Because I covered similar material in my Resampling Methods post, I won’t be going into great detail about the techniques described here.
I. Model Tuning
Behold! A nice picture of the parameter tuning process.^{1}
Overfitting is when the model has learned the characteristics of a particular sample’s unique noise, and will have poor accuracy when predicting a new sample. Splitting the data into test and training sets or using resampling techniques to measure error mitigates the effect of overfitting.
Data Splitting
When is not large, a strong case can be made for using crossvalidation instead of splitting the data, which will allow the model to learn on more data points. Types of data splitting include:
 A simple random sample of the data
 Stratified random sampling of the data, for balanced outcomes in the test dataset
 Maximum dissimilarity sampling: initialize the test set with a single sample, and allocate the most dissimilar unallocated sample (must specify a dissimilarity measure). Repeat until the target test set size is achieved.
Resampling Techniques
 Kfold crossvalidation
 In practice, or are commonly used.
 A bigger leads to lower bias and higher variance.
 Generally has high variance compared to other validation methods.
 Leaveoneout crossvalidation is a special case where .
 Repeated Training/Test Splits
 AKA, “Monte Carlo CrossValidation” or “Leavegroupout CrossValidation”
 Rule of thumb is to use around 7580% of the data for training splits.
 Use a large number of repetitions (say, 50 to 200+)
 Bootstrap error
 Fit the model on a bootstrap sampling of the data, then predict the outofbag data points for an error estimate
 In general, less uncertainty than kfold CV and has similar bias to kfold CV with
 On average, 63.2% of the data points are represented in the bootstrap sample.
 An alternate “632 method” for estimating error is
II. Recommendations
Choosing the final tuning parameters:
 Choose the model associated with the numerically best performance estimates.
 Choose the simplest model whose performance is within a single standard error of the numerically best value (onestandarderror method).
 Choose the simplest model that is within a certain tolerance of the numerically best value (% decrease in performance from the numerically optimal value, , can be calculated by ).
Data Splitting and Resampling:
 For small datasets, use 10fold cross validation.
 No resampling method is uniformly better than another.
 For large sample sizes, differences become less pronounced and computational efficiency becomes more important.
 For choosing between models, consider using one of the bootstrap procedures since they have very low variance.
Choosing between models:
 Start with the most flexible models (boosted trees, SVMs) to get a sense of the empirically optimum results.
 Investigate simpler, more interpretable models (MARS, PLS, GAMs, Naïve Bayes).
 Consider using the simplest model that reasonably approximates the performance of the more complex models.
 Note: A paired ttest can be used to evaluate the hypothesis that the models have equivalent average accuracies
III. Applications in R

Source Kuhn and Johnson, Applied Predictive Modeling (2013), pg. 66 ⤴