The purpose of this series of posts is to create a concise overview of important modelling concepts, and the intended audience is someone who has learned these concepts before, but would like a refresher on the most important bits (e.g. myself).
Statistical learning involves building models to understand data.
Why estimate the function, f, that connects the input and output?
- estimate output with a black box function to minimize the reducible error
- which predictors are associated with the response?
- what is the relationship between te response and each predictor?
- can the relationship between Y and each predictor be adequately summaried using linear equation?
How do we estimate f?
- Parametric Methods make an assumption about functional form (i.e. linear), then use training data to fit or train the model.
- Pro: Simplifies the problem down to estimating a set of parameters, and results are easily interpretable
- Con: The chosen model will likely not match the unknown form of f
- Non-parametric Methods do not make explicit assumptions about the functional form of f.
- Pro: Potential to accurately fit a wider range of possible shapes
- Con: Need a very large number of observations to obtain an accurate estimate.
What are the two types of statistical learning?
- Supervised learning has predictor measurements and associated response measurements.
- Unsupervised learning as observed measurements, but no associated response; often seek to understand relationships between variables or between observations.
- Note: semi-supervised learning is when there are a limited number of response observations, and these methods are outside the scope of this book)
How do we assess model accuracy?
- Bias-Variance Tradeoff
- The goal is to develop a model that balances inflexible methods with large bias/small variance and flexible methods with small bias/large variance:
- Regression: Mean squared error
- Classification: Error rate
where is an indicator variable that equals 1 if the prediction is incorrect and 0 if correct.
- The Bayes classifier on average minimizes the test error rate by assigning each observation to the most likely class, given its predictor values: .
- The Bayes decision boundary is the separating boundary between classes (note: K-nearest neighbors often gets very close to the optimal Bayes classifier).
- The Bayes error rate is the lowest possible test error rate:
Details are covered in the linked posts.
Regression: Predicting or explaining a continuous (quantitative) output
- Simple Linear Regression
- Multiple Linear Regression
- K-Nearest Neighbors Regression
Classification: Predicting or explaining a categorical (qualitative) output
- K-Nearest Neighbors
- Logistic Regression
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
Resampling Methods: Techniques that produce more accurate models
- Cross validation
Linear Model Selection and Regularization: Subset selection, shrinkage, and dimension reduction techniques
- Linear Regression with forward, backward, and best subset selection
- L2 Regularization (Ridge Regression)
- L1 Regularization (Lasso)
- Principle Components Regression
- Partial Least Squares
Moving Beyond Linearity: Removing the linearity assumption
- Polynomial Regression and Step Functions
- Regression Splines
- Smoothing Splines
- Local Regression
- Generalized Additive Models
Tree-based Methods: Stratifying or segmenting the predcitor space into regions
- Regression and Classification Trees
- Random Forests
- Maximal Margin Classifier
- Support Vector Classifier
- Support Vector Machines (linear, polynomial, and radial kernel)
Unsupervised Learning: finding subgroups among variables, or grouping individuals according to observed characteristics
- Principle Components Analysis
- K-means Clustering
- Hierarchical Clustering