I. Introduction to Modeling
The foundation of an effective predictive model is laid with intuition and knowledge of the problem context, relevant data, and computational toolbox with techniques for data pre-processing, visualization, and modeling. There is often a tradeoff between predictive accuracy and interpretability.
Why predictive models fail:
- Inadequate pre-processing of data
- Inadequate model validation
- Unjustified extrapolation
- Overfitting model to existing data
It is important to understand the characteristics of the data set in order to properly construct a model:
- Recognizing the distribution of the response variable is necessary for splitting data into training and testing sets
- For continuous response data, is the distribution of the response symmetric or skewed?
- For categorical response data, is the distribution balanced or unbalanced?
- Recognizing the characteristics of the predictors is necessary for data pre-processing and model selection
- Missing values?
- Different scales of measurements?
- High correlation with other predictors?
- Are predictors sparse (only a few contain unique information)?
- Do they follow symmetric/skewed, or balanced/unbalanced distribution?
- Do they even have an underlying relationship with the response?
- The relationship between the number of samples and predictors is important for model selection
- Take computational time into account
- Dimension reduction techniques can be useful
- Techniques such as recursive partitioning and K-nearest-neighbors can be used
II. Overview of Topics
- Data Pre-processing
- Overfitting and Model Tuning
- Linear Regression
- Partial Leaste Squares
- L1 regularization (Lasso)
- Neural Networks
- Multivariate adaptive regression splines (MARS)
- Support vector machines (SVMs)
- Tree-based models
- Regression trees
- Bagged trees
- Random forests
- Discriminant analysis (linear, quadratic, regualrized, partial least squares)
- Penalized methods for classification
- Flexible discriminant analysis
- Neural networks
- Naive Bayes
- Nearest Shrunken Centroids
- Tree-based models
- Measuring predictor importance
- Feature selection techniques
- Factors that can affect model performance