Contents
I. Overview of Decision Trees
Trees involve stratifying or segmenting the predictor space into a number of simple regions, where the prediction of an observation is typically the mean or mode of the training observations in the region to which it belongs.
Pros:
 Easy to explain
 Can be displayed graphically and are easily interpreted
 Can easily handle qualitative predictors without dummy variables
 May more closely mirror human decisionmaking than least squares or other classical models
Cons:
 Not very robust to new observations
 In general, lower predictive accuracy than other regression and classification approaches unless we use bagging, random forests, and boosting, which increase predictive performance but decrease interpretability
II. Regression Trees
A regression tree is typically drawn upside down, with the terminal nodes (leaves) at the bottom. The predictor space is split at a number of internal nodes connected by branches. To build a regression tree:
 Divide the predictor space, the set of all posible values for into distinct, nonoverlapping regions, .
 For every observation that falls into region , predict the response as the mean of the response values for the training observations in .
We choose the regions to minimize the RSS: , where is the mean response of training observations in the th box. This is a top down, greedy approach called recursive binary splitting.^{1}
Recursive Binary Splitting
To perform recursive binary splitting, we consider and all possible cutpoints for each predictor, then choose the predictor and cutpoint to minimize RSS. In formal terms,
Where represents the region of the predictor space in which takes as value less than , and and are the mean responses for the training observations in and , respectively. We repeat this process by splitting one of the two previously identified regions until a stopping condition is reached.
Cost Complexity Pruning
The recursive binary splitting process is likely to create an overly complex tree and overfit the data. To address this, we can lower the variance with costcomplexity pruning or weakest link pruning with a nonnegative tuning parameter :
For each value of , such that the following equation is minimized:
 is the number of terminal notes
 is the subset of predictor space corresponding to the th terminal node
 is the predicted response
 controls the tradeoff beween complexity and fit; as , a small subtree is chosen
We choose with Kfold cross validation, averaging results for each value of , and selecting to minimize the average error.
Summarized Algorithm:
 Use recursive binary pruning to grow a large tree until each terminal node reaches minimum number of observations.
 Apply costcomplexity pruning to obtain a sequence of best subtees as a functio of .
 Use kfold crossvaludation to choose by repeating the first two steps on the th fold and averaging all of the MSEs.
 Return the subtree from step 2 for the chosen best value of .
III. Classification Trees
Classification trees are similar to regression trees; they are grown with recursive binary splitting (refer to the previous section). The predicted class is chosen by the most common class in the region.
However, we need an alternative to RSS to use as criterion for making the binary splits. There are a several types of error we can choose to minimize when building classification trees:
 Classification error
 , where is the proportion of training observations in region from class .
 this type of error is not very good to use with trees
 Gini Index
 ;
 this is a measure of total variance across the classes
 measures node purity; a smaller value means that the node contains predominantly observations from a single class
 node purity is important because it is a marker for prediction certainty
 Entropy
 ;
 note that entropy is always positive, since
 this is small if ’s are all near 0 or all near 1 (i.e. if the th node is pure)
 similar to the Gini index, can be used as an alternative
IV. Method: Bagging
Bagging, or bootstrap aggregation, reduces the variance of a statistical learning method. Decision trees suffer from high variance, so we can reduce the variance by taking many training sets from the population.
Generate different bootstrapped data sets, train the method on the th data set to obtain , then average all of the predictions to obtain the function obtained from bagging:
For qualitative responses, the overall prediction of the bagging tree is is the most commonly occurring class among the predictions.
, the number of trees, is not a critical parameter! In practice, we just need to use a large B so that the error has settled down, which is typically a value greater than 100.
Outofbag Error Estimation is a straightforward way to estimate the test error of a bagged model.
 Each bagged tree makes use of about 2/3 of the data.
 For the th observation, there are about /3 predictions for in which is outofbag.
 Obtain a single prediction for the th observation with an average (if regression) or majority vote (if classification)
 Calculate the outofbag MSE to estimate the test error. This is virtually equivalent to LOOCV.
Variable importance measures
 Note that bagging increases prediction accuracy, but lowers interpretability.
 To measure variable importance for regression, take the average over all trees of the total amount that RSS degreases due to splits over a given predictor.
 To measure variable importance for classification, take the average over all trees of that total amount that the Gini index decreases by splits over a given predictor.
V. Method: Random Forests
Random Forests are an extension of bagging, where we generate different bootstrapped data sets. However, each time a split is considered, a random sample of predictors is chosen as split candidates. Usually, .
 If there is 1 very strong predictor, then bagging will not substantially reduce the variance of the tree fit.
 In random forests, predictors will not consider the strong predictor, which decorrelates the trees, making the average less variable and more reliable. In particular, use a small value of when there are a large number of correlated variables.
 Like bagging, any value of will not overfit the data, so choose such that the error has settled down.
VI. Method: Boosting
Boosting is a general approach that can be used for many statistical learning methods, including decision trees. It works similarly to bagging, except trees are grown sequentially, and fit on a modified version of the original data set.
There are three tuning parameters with boosting:
 The number of trees, .
 choose with cross validation, since a large can lead to overfitting (slowly)
 The shrinkage parameter, , which controls the rate at which boosting learns.
 typically set to .01 or .001
 note that a very small can require a very large for good performance
 The number of splits in each tree, , which controls the complexity of the boosted trees
 often, works well; this is when each tree is a stump with a single split
 is also the interaction depth, and controls the interaction order of boosted models
Algorithm: Boosting for Regression Trees
 Set , and for all in the training set
 For :
 Fit a tree with splits ( terminal nodes) to the training data
 Update by adding a shrunken version of the new tree:
 Update the residuals:
 Output the boosted model:
Boosting learns slowly by fitting decision trees to the residuals of the current tree. Trees are grown sequentially, and the shrinkage parameter slows down the process.
Note: smaller trees can help with interpretability. For example, using stumps can also be interpreted as creating an additive model.
VII. Applications in R

Recursive binary splitting is topdown because it begins at the top of the tree where all observations begin at a single region. It is greedy because at each step of the treebuilding process, the best split is made at that step, rather than looking ahead and picking a split that will lead to a better tree in a future step. ⤴