Contents
I. Simple Linear Regression
The Basics – simple linear regression takes the form:
To find the line that is closest to the data, we use the least squares method and minimize the residual sum of squares (RSS), the amount of unexplained variability in the response after the regression:
Solving for the coefficient estimates and , we find that:
 Note: we can also prove that these estimates are unbiased, i.e. and
Measuring accuracy of the coefficient estimates – using the general variance formula, , we can solve for the standard error of the least squares coefficients:
 Note: , the residual standard error (RSE) is given by .
These standard errors can be used to compute confidence intervals; for example, . A 95% confidence interval means that if repeated samples were taken and 95% confidence intervals were computed for each sample, 95% of those intervals are expected to contain the population mean.
To perform hypothesis testing, we compute a tstatistic, given by . These have an associated pvalue, which represents the probability, assuming the null hypothesis is true, that we received the observed sample results.
Measuring accuracy of the model
 The residual standard error (RSE) is a measure of the lack of fit of the model in units of the response
 The statistic is the proportion of variability in Y that can be explained using X, and is calculated with the formula:
where the total sum of squares (TSS), given by , measures the total variance in the response before the regression.
 Note: The statistic is equal to the correlation between X and Y in simple linear regression with one predictor.
II. Multiple Linear Regression
The Basics – multiple linear regression takes the form:
Again, the least squares solution involves minimizing the residual sum of squares (RSS).
Is there a relationship between the response and predictors?
We can use a hypothesis test to investigate whether all of the regression coefficients are zero using an Fstatistic:
If linear model assumptions are correct, then , and if there is no relationship between the response and predictors, then . Therefore, when there is no relationship between the response and predictors, Fstatistic will take on a value close to 1, but when there is a relationship, the Fstatistic will be greater than 1.
 Note: The smaller the sample size n, the larger the Fstatistic needs to be to reject the null hypothesis.
On the other hand, if you want to test if a particular subset of q coefficients are zero, then fit a second model that uses all of the variables except the last q, and note the residual sum of squares for that model, . Then the null hypothesis and Fstatistic are:
How do we decide on important variables?
Variables can be selected through three classical approaches:
 Forward selection
 Start with the null model, fit new models with each of the p predictors, then add the variable that resulted in the lowest RSS. Refit the base model, fit new models with remaining predictors, and repeat until a stopping condition.
 Backward selection
 Start with all variables in the model, and remove the variable with the largest pvalue. Refit the model, and repeat until a stopping condition.
 Cannot be used if
 Mixed selection
 Start with no variables in the model. Add variables that provide the best fit, but remove variables if the pvalue for any variable rises above a certain threshold.
How well does the model fit the data?
 For multiple linear regression, . Note that will always increase for the training data set when adding more variables, so be sure to test the model on a testing data set.
 Residual standard error is defined generally as:
The quality of a model can be judged by:
 Mallow’s
 AIC (Akaike information Criterian)
 BIC (Bayesian information criterion)
 Adjusted
To compare two models, we can perform ANOVA (analysis of variance) using an Ftest in order to test the null hypothesis that a model is sufficient to explain the data against the alternative hypothesis that a more complex model is required. These must be nested models: the predictors in must be a subset of the predictors in .
How accurate are our model predictions?
There are three types of uncertainty associated with prediction:

The least squares plane, , is only an estimate for the true population regression plane, . This inaccuracy in the coefficient estimates is related to the reducible error of the model, and we can estimate this uncertainty with a confidence interval.

Another source of reducible error comes from the model bias, since we are estimating the best linear approximation to the true surface, which may not be linear.

Even if we knew , the response value can’t be predicted perfectly because of the random error, in the model. This is the irreducible error. We can estimate how much will vary from with prediction intervals, which are always wider than confidence intervals because they incorporate both the error in the estimate for and the uncertainty as to how much an individual point will differ from the population regression plane.
III. Considerations and Potential Problems
Considerations
 Qualitative Predictors
 For predictors with two values/levels, we use dummy variable that takes the value of either 1 or 0. Note that the assigned values will affect interpretation.
 For predictors with more than two values, we can create additional dummy variables (a total of 1 less than the number of levels).
 We can also code qualitative variables in other ways to measure particular contrasts between variables, which will lead to equivalent models with different coefficients and interpretations.
 Extensions of the Linear Model
 Linear regression assumes that the relationship between response and predictor is additive and linear.
 Nonadditive relationships can be representated using an interaction effect in addition to the main effects. Be sure to abide by the hierarchical principle – if using an interaction term, always include the main effects as well since is often correlated with and .
 Nonlinear relationships can be represented using polynomial regression, among other methods.
 Linear regression assumes that the relationship between response and predictor is additive and linear.
Potential Problems
 Nonlinearity of the repsonsepredictor relationships
 check the residual plots of against to make sure the errors are approximately random
 if residuals exhibit any trends, then try using nonlinear transformations of predictors, such as
 Correlation of error terms
 should provide no information about , otherwise the sample standard error will be an underestimate of the true standard error, a confidence interval will not be wide enough, and pvalues will be lower than they should be.
 in time series data, residuals plotted against time may exhibit tracking if adjacent residuals are similar; there are many methods (not shown here) that can be used to address this
 outside of timeseries data, good experimental design can mitigate the risk of correlated errors
 Nonconstant variance of error terms
 heteroskedasticity, or a nonconstant , looks like a funnelshape in residual plots
 potential solution: try transforming the response variable with or
 potential solution: weighted least squares can solve this; for example, if the th response is an average of raw observations that are uncorrelated with their variance, then their average has variance and the nonconstant variance is solved by using weighted least squares with observation weights
 Outliers
 outliers that can affect the model, standard error, confidence intervals, and
 check for outliers with studentized residual plots, where residuals are scaled by the standard error . If studentized residuals are greater than 3, then the observations are potential outliers
 Highleverage Points
 points with an unusual xvalue are highleverage points that can adversely affect the model
 leverage can be quantified with the leverage statistic^{1}; this is always between and 1, and the average is
 if a given observation greatly exceeds , then the corresponding point may have high leverage
 Collinearity
 when two or more predictors are highly correlated, it reduces the accuracy of estimates – the standard error of increases, tstatistic decreases, and the power (chance of correctly determining a nonzero coefficient) of the hypothesis test goes down
 potential solutions include dropping or combining collinear variables
 use a correlation matrix to detect collinearity between two variables
 use the variance inflation factor (VIF), shown below, to assess multicollinearity
 is the of the regression of onto all other predictors.
 Note that the smallest value of 1 indicates no collinearity; if it exceeds 5 or 10, then it may be a problem.
IV. Comparison to KNearest Neighbors Regression
KNearestNeighbors regression takes the form:
In other words, given a value for K and a prediction point , KNN regression identifies the training observations that are closest to , represented by , and estimates using the average of all the training responses in .
 The optimal K depends on the variancebias tradeoff; a bigger K leads to a smoother, less flexible fit with low variance and high bias while a smaller K leads to a bumpier fit with high variance and low bias.
 A parametric approach will outperform a nonparametric form if it is close to the true form of , or when there is a small number of observations per predictor.
 KNN Regression suffers from the curse of dimensionality:
 more predictors (dimensions) means neighbors are farther from each other
 KNN performs poorly with lots of predictors
 in higher dimensions, KNN often performs worse than linear regression
V. Applications in R

For simple linear regression, the leverage statistic is given by . ⤴