Contents
I. Basis Functions
Basis functions are a family of functions of transformations, , applied to to and take the form:
Two very common types of basis functions are polynomial regression, with , and piecewise constant step functions, with .
Polynomial Regression
This is a simple way to extend the linearity assumption. However, it’s unusual to use powers of greater than 3 or 4 due to weirdness at the boundaries.
Note that the variance, can be computed with the variance estimates for each of the coefficients and the covariances between pairs of the coefficient estimates.^{1}
Step Functions
Step functions break into bins, so it becomes an ordered categorical variable. To create a step function, create cutpoints , then create new variables:
 ;
 ;
 …
 ;
is an indicator function that returns a 1 if the condition is true, and 0 otherwise. We use least squares to fit a linear model using each of these new variables as predictors:
These functions are popular in biostatistics and epidemiology, but one downside is that it can miss important action in the data unless there are natural breakpoints.
II. Regression Splines
Regression splines are also a class of basis functions that extend upon polynomial regression and piecewise constant regression methods in the previous section, but I’m giving them their own section because they are more complicated and deserve some more space!
Splines of degree are piecewise degree polynomials with continuity in derivatives up to degree at each knot to ensure continuity (knots are the breakpoints where the polynomial coefficients change).
A cubic spline with knots uses a total of degrees of freedom, with continuity of the first two derivatives. This is because piecewise cubic functions have a total of sections, each section adds 4 degrees of freedom (one for each coefficient), and each knot has 3 contraints (continuity of the function, its 1st derivative, and its 2nd derivative).
Example: A Cubic Spline Representation
This is a representation of a cubic spline with two knots, and . We add one truncated power basis per knot to guarantee our continuity constraints,^{2} for a total of basis functions. In this case, 2 knots + 3 degrees of the polynomial = 5 total basis functions.
The truncated power basis is defined as:
In other words, fitting a cubic spline means that we’re performing least squares regression with the predictors: . Note that there are a total of coefficients.
A natural spline is a regression spline that is required to be linear at the boundary, which provides more stable estimates at the boundaries. The additional boundary constraints lead to relatively narrower confidence intervals at the boundaries.
Important Information about Regression Splines:
 In practice, it’s common to place knots uniformly along the data
 Use crossvalidation to choose the number of knots, based on the value that minimizes RSS.
 In general, regression splines give superior results to polynomial regression, and also gives more stable estimates at boundaries.
III. Smoothing Splines
In minimizing RSS, we want to find some function that fits the observed data well, but we don’t want it to interpolate all of the data, since it overfits the data by being too flexible. Instead, we want to find that makes RSS small, but is also sufficiently smooth.
We ensure that is smooth by minimizing the loss function:
Where is a nonnegative tuning parameter that controls the biasvariance tradeoff. If , then the smoothing spline will be a 100% fit to the data, and as , the model will be a straight line.
This takes on the “Loss + Penalty” form that we also see in ridge regression and lasso. The second half of the equation above is a penalty term that penalizes variability in the model, since the second derivative measures the “roughness” of the model, and is large if is wiggly.
Special Properties of the Smoothing Spline
The smoothing spline is a natural cubic spline. Compared to the basis function approach of a regression spline, the smoothing spline is a shrunken version, where controls the levels of shrinkage.
 it is a piecewise cubic polynomial with continuous 1st and 2nd derivatives
 knots at the unique values of
 linear in regions outside of the extreme knots
Choosing the tuning parameter : As increases from to , the effective degrees of freedom decreases from to . We use effective degrees of freedom since although a smoothing spline has parameters and hence nominal degrees of freedom, those parameters are heavily constrained or shrunk down. is a better measure of the flexbility of a smoothing spline.
We define the effective degrees of freedom as ; or the sum of the diagonal elements of where , the solution to a particular .
Leaveoneout cross validation is very efficient (the same cost as computing a single fit) for choosing :
Where indicates the fitted value for this smoothing spline evaluated at , where the fit uses all of the training observations except for the th observation . Note that we can compute all of the LOOCV fits using only the original fit to all of the data!
IV. Local Regression
For local regression, we compute the fit at a target point using only the nearby training observations.
Algorithm:
 Gather the fraction of training points whose are closest to
 Assign a weight to each point in the neighborhood where the closest has the highest weight, the farthest has 0 weight, and all other points have 0 weight.
 Fit a weighted least squares of onto :
 The fitted value at is
Notes:
 This is a memorybased procedure, which means it needs all training data for each calculation of a prediction.
 The span, , is the tuning parameter that we choose based on crossvalidation. A small span means the fit is more local and wiggly with higher bias/lower variance, while a large span will lead to a more global fit with lower bias/higher variance.
 Local regression can be used for 1, 2, or 3 variables, but is bad when there are more than 3 or 4 variables due to the curse of dimensionality.
 This can also be generalized to a multiple linear regression with some global and some local variables. These are called varying coefficient models.
V. Generalized Additive Models
Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing nonlinear functions of each of the variables while maintaining additivity.
Pros:
 Allows fitting a nonlinear to each to model nonlinear relationships. This means we don’t have to try transformations on every variable
 Can make more accurate predictions of
 Can examine the effect of each on individually, which is good for inference and interpretability
 Smoothness of for variable can be summarized with the degrees of freedom.
Cons:
 Models are restricted to being additive, so important interactions can be missed. However, we can add interactions terms or functions manually to fix this problem.
For regression problems, models can combine natural splines, smoothing splines, local regression, polynomial regression, etc. This is a compromise between linear models and fully nonparametric models, and can be represented by:
For classification problems, logistic GAMs can be represented by:
We can use ANOVA to test if certain coefficients in generalized additive models are significant.
VI. Applications in R

For the degree 4 case, , where is the 5x5 covariance matrix of the and ⤴

It is possible to prove that adding a term of the form to the model for a cubic polynomial will lead to a discontinuity in only the third derivative at , and the function will remain continuous, with conitnuous first and second derivatives, at each of the knots. ⤴