Note: Because I covered similar material in my Resampling Methods post, I won’t be going into great detail about the techniques described here.
I. Model Tuning
Behold! A nice picture of the parameter tuning process.1
Overfitting is when the model has learned the characteristics of a particular sample’s unique noise, and will have poor accuracy when predicting a new sample. Splitting the data into test and training sets or using resampling techniques to measure error mitigates the effect of overfitting.
When is not large, a strong case can be made for using cross-validation instead of splitting the data, which will allow the model to learn on more data points. Types of data splitting include:
- A simple random sample of the data
- Stratified random sampling of the data, for balanced outcomes in the test dataset
- Maximum dissimilarity sampling: initialize the test set with a single sample, and allocate the most dissimilar unallocated sample (must specify a dissimilarity measure). Repeat until the target test set size is achieved.
- K-fold cross-validation
- In practice, or are commonly used.
- A bigger leads to lower bias and higher variance.
- Generally has high variance compared to other validation methods.
- Leave-one-out cross-validation is a special case where .
- Repeated Training/Test Splits
- AKA, “Monte Carlo Cross-Validation” or “Leave-group-out Cross-Validation”
- Rule of thumb is to use around 75-80% of the data for training splits.
- Use a large number of repetitions (say, 50 to 200+)
- Bootstrap error
- Fit the model on a bootstrap sampling of the data, then predict the out-of-bag data points for an error estimate
- In general, less uncertainty than k-fold CV and has similar bias to k-fold CV with
- On average, 63.2% of the data points are represented in the bootstrap sample.
- An alternate “632 method” for estimating error is
Choosing the final tuning parameters:
- Choose the model associated with the numerically best performance estimates.
- Choose the simplest model whose performance is within a single standard error of the numerically best value (one-standard-error method).
- Choose the simplest model that is within a certain tolerance of the numerically best value (% decrease in performance from the numerically optimal value, , can be calculated by ).
Data Splitting and Resampling:
- For small datasets, use 10-fold cross validation.
- No resampling method is uniformly better than another.
- For large sample sizes, differences become less pronounced and computational efficiency becomes more important.
- For choosing between models, consider using one of the bootstrap procedures since they have very low variance.
Choosing between models:
- Start with the most flexible models (boosted trees, SVMs) to get a sense of the empirically optimum results.
- Investigate simpler, more interpretable models (MARS, PLS, GAMs, Naïve Bayes).
- Consider using the simplest model that reasonably approximates the performance of the more complex models.
- Note: A paired t-test can be used to evaluate the hypothesis that the models have equivalent average accuracies
III. Applications in R
Source Kuhn and Johnson, Applied Predictive Modeling (2013), pg. 66 ⤴