Contents
I. KNearest Neighbors
Given a positive integer and a test observation , the Knearest neighbors (KNN) classifier identifies the points in the training data closest to , represented by , and then estimates the conditional probability for class as the fraction of points in whose response values equal :
KNN has high bias and low variance when is large, and vice versa when is small. While the training error rate will always decrease as decreases, the test error will take on a characteristic Ushape due to overfitting.
II. Logistic Regression
Logistic Regression with One Predictor
Logistic regression models the probability that belongs to a particular category.
We can rewrite this equation to represent the odds , and logodds/logit of :
 Note that because logistic regression takes an shape, the effect of oneunit increase in on depends on the current value of .
We estimate the coefficients and using estimates that maximize the likehood that the logistic model produced the observed data; in other words, plugging the coefficients into the logistic model should result in a number close to 1 for all observations with , and close to 0 for all observations with . This is accomplished by maximizing the likelihood function, :
We can conduct a hypothesis test to test if the probability of does not depend on with null hypothesis , by using a statistic,
Multiple Logistic Regression
We can extend the basic logistic regession model to include multiple predictors:
Be careful of confounding variables! The results obtained using one predictor may be different than those obtained with multiple predictors, especially if predictors are correlated with each other.
 Note: Logistic regression is mostly used when the response variable has two classes. It is possible to use logistic regression for a response variable with more than two classes; however, linear discriminant analysis is much more often used in that scenario.
III. Linear Discriminant Analysis
Linear discriminant analysis (LDA) focuses on maximizing the separability among categories by modeling the distributions of the predictors separately in each response class, , then using Bayes’ theorem to flip them into estimates of .
Motivation
Why use LDA instead of logistic regression?
 When the classes are wellseparated, the parameter estimates for LDA are more stable than for logistic regression.
 If is small and the distribution of predictors is approximately normal in each class, then LDA is more stable than logistic regression.
Let us define the following conditions:
 is the prior probability that a random observation comes from the th class
 is the density function of for an observation from the th class
 is the posterior probability that an observation belongs to the th class
Then Bayes’ theorem states that:
The goal is to estimate to develop a classifier that estimates the Bayes classifier.
LDA for one predictor
Assumptions:
 is normal or gaussian, and its density takes the form
 Variance is constant within each class;
We assign each observation to the class that maximizes the discriminant function, which we obtain by plugging the into Bayes’ theorem and simplifying:
The parameters and are estimated as follows:
Thus, the LDA classifier results are calculated by plugging estimates for each parameter with the observation value into the Bayes classifier.
 Note that for the twoclass scenario with , then the Bayes decision boundary correponds to the point where .
LDA for multiple predictors
Assumption:
 is drawn from a multivariate gaussian distribution with a classspecific mean vector and common covariance matrix, with density:
If observations in the th class are drawn from , then we can plug in the density function to find that the Bayes classifier assigns an observatino to maximize the discriminant function:
The unknown parameters are estimated with formulas similar to those in the onedimensional case. To assign a new observation, LDA plugs the parameter estimates into the discriminant function and classifies it to the class for which is largest.
Considerations when using LDA
 The higher the ratio of parameters to sample size , the more likely LDA will overfit the data.
 LDA has low sensitivity (power), because it is based off of the Bayes classifier which minimizes total error by assigning observations to categories if
A confusion matrix is a table that displays predicted vs. actual values, and they can be used to calculate the following important measure for classification and diagnostic testing:
 Sensitivity (aka power) =
 Specificity =
 Type I Error is the false positive rate, also 1  Specificity
 Type II Error is the false negative rate, also 1  Sensitivity
 Note: check out Wikipedia for a nice visualization of a 2x2 confusion matrix and all of its related statistics.
We can address problem #2 by setting a new threshold for assigning an observation to a certain class, based on domain knowledge. Setting a new threshold can be visualized with an ROC curve (receiver operating characteristics). An example is shown below.^{1}
The ROC curve graphs False Positive Rate (Type I Error) against the True Positive Rate (Power/Sensitivity). Ideally, it should hug the top left corner of the plot. Increasing the threshold will move results to the lower left along the ROC curve, and decreasing the threshold will move results to the upper right.
AUC is the area under the ROC curve, and it tells us how much the model is able to distinguish between classes. The higher the AUC value, the better the model, with a max value = 1.
 High AUC: model classifies observations well
 Close to .5: model has no separation ability
 Low AUC: model classifies the opposite way
IV. Quadratic Discriminant Analysis
Quadratic distriminant analysis (QDA) assumes that each class has its own covariance matrix, so an observation in the th class is given by .
We assign an observation to the class where is largest. Note that is a quadratic function of , as opposed to a linear function.
Comparing LDA and QDA:
 when there are p predictors, assuming a separate covariance matrices with QDA means estimating parameters. However, LDA only requires estimates linear coefficients by assuming a common covariance matrix.
 LDA has higher bias and lower variance
 QDA has lower bias and higher variance
 LDA: useful when there are fewer training observations, when reducing variance is important
 QDA: useful when the training set is ver large, or if you can’t assume a common covariance matrix
V. Comparison of Classification Methods
Logistic regression and LDA produce linear decision boundaries. While logistic regression estimates parameters with maximum likelihood, LDA estimates parameters with the estimated and from a normal distribution. Therefore, we should use LDA when we can assume that observations are drawn from a gaussian distribution with common covariance matrix, and use logistic regression when those assumptions are not met.
QDA can model a wider range of data than logistic regression and LDA, but it is not as flexible as KNN. However, QDA performs better than KNN with limited training observations.
KNN dominates when the true decision boundary is nonlinear, but it suffers a major drawback that it doesn’t show which predictors are important in obtaining the result.
Summary of Each Method
Logistic Regression
 Linear decision boundary
 Provides interpretability with odds ratios
 High bias, low variance
Linear Discriminant Analysis
 Linear decision boundary
 Provides interpretability with predictor ability to separate variation
 Assumes all observations are drawn from normal distributions
 Assumes observations in all classes share a covariance matrix
 Stable when classes are “wellseparated”
 Stable when sample size is small
 Commonly used over logistic regression when response is >2 classes
 Can adjust assignmentprobability threshold for a better specificity rate with the ROC curve
 High bias, low variance
Quadratic Discriminant Analysis
 Quadratic decision boundary
 Assumes observations are drawn from a multivariate normal distribution
 Assumes different covariance matrices for each class
 Performs well with many training observations compared to LDA
 Medium bias, medium variance
KNearest Neighbors
 Nonparametric decision boundary
 No interpretability
 Requires smart selection of the smoothness
 Bad when the number of predictors is large, due to the curse of dimensionality
 Low bias, high variance
VI. Applications in R

Image source: James et. al, Introduction to Statistical Learning, 7th Ed., pg. 148 ⤴