# Tree-Based Methods

Decision trees, which divide the predictor space into regions, are simple and useful for interpretation. Their predictive power can be improved with bagging, random forests, and boosting.

May 1, 2020 - 13 minute read -

### I. Overview of Decision Trees

Trees involve stratifying or segmenting the predictor space into a number of simple regions, where the prediction of an observation is typically the mean or mode of the training observations in the region to which it belongs.

Pros:

• Easy to explain
• Can be displayed graphically and are easily interpreted
• Can easily handle qualitative predictors without dummy variables
• May more closely mirror human decision-making than least squares or other classical models

Cons:

• Not very robust to new observations
• In general, lower predictive accuracy than other regression and classification approaches unless we use bagging, random forests, and boosting, which increase predictive performance but decrease interpretability

### II. Regression Trees

A regression tree is typically drawn upside down, with the terminal nodes (leaves) at the bottom. The predictor space is split at a number of internal nodes connected by branches. To build a regression tree:

1. Divide the predictor space, the set of all posible values for $X_1, X_2, ..., X_p$ into $J$ distinct, non-overlapping regions, $R_1, R_2, ..., R_J$.
2. For every observation that falls into region $R_j$, predict the response as the mean of the response values for the training observations in $R_j$.

We choose the regions to minimize the RSS: $\sum_{j = 1}^J \sum_{i \in R_j} (y_i - \hat y_{R_j}) ^ 2$, where $\hat y_{R_j}$ is the mean response of training observations in the $j$th box. This is a top down, greedy approach called recursive binary splitting.1

Recursive Binary Splitting

To perform recursive binary splitting, we consider $X_1, ..., X_p$ and all possible cutpoints $s$ for each predictor, then choose the predictor and cutpoint to minimize RSS. In formal terms,

Where $% $ represents the region of the predictor space in which $X_j$ takes as value less than $s$, and $\hat y_{R_1}$ and $\hat y_{R_2}$ are the mean responses for the training observations in $R_1 (j, s)$ and $R_2 (j, s)$, respectively. We repeat this process by splitting one of the two previously identified regions until a stopping condition is reached.

Cost Complexity Pruning

The recursive binary splitting process is likely to create an overly complex tree and overfit the data. To address this, we can lower the variance with cost-complexity pruning or weakest link pruning with a nonnegative tuning parameter $\alpha$:

For each value of $\alpha$, $\exists T \subset T_0$ such that the following equation is minimized:

• $\vert T \vert$ is the number of terminal notes
• $R_m$ is the subset of predictor space corresponding to the $m$th terminal node
• $\hat y_{R_m}$ is the predicted response
• $\alpha$ controls the tradeoff beween complexity and fit; as $\alpha \to \infty$, a small subtree is chosen

We choose $\alpha$ with K-fold cross validation, averaging results for each value of $\alpha$, and selecting $\alpha$ to minimize the average error.

Summarized Algorithm:

1. Use recursive binary pruning to grow a large tree until each terminal node reaches $X$ minimum number of observations.
2. Apply cost-complexity pruning to obtain a sequence of best subtees as a functio of $\alpha$.
3. Use k-fold cross-valudation to choose $\alpha$ by repeating the first two steps on the $k$th fold and averaging all of the MSEs.
4. Return the subtree from step 2 for the chosen best value of $\alpha$.

### III. Classification Trees

Classification trees are similar to regression trees; they are grown with recursive binary splitting (refer to the previous section). The predicted class is chosen by the most common class in the region.

However, we need an alternative to RSS to use as criterion for making the binary splits. There are a several types of error we can choose to minimize when building classification trees:

• Classification error
• $E = 1 - \max_k (\hat p_{mk})$, where $\hat p_{mk}$ is the proportion of training observations in region $m$ from class $k$.
• this type of error is not very good to use with trees
• Gini Index
• $G = \sum_{k = 1}^K \hat p_{mk} (1 = \hat p_{mk})$ ;
• this is a measure of total variance across the $K$ classes
• measures node purity; a smaller value means that the node contains predominantly observations from a single class
• node purity is important because it is a marker for prediction certainty
• Entropy
• $D = - \sum_{k = 1}^K \hat p_{mk} \log \hat p_{mk}$ ;
• note that entropy is always positive, since $- \hat p_{mk} \log \hat p_{mk} > 0$
• this is small if $\hat p_{mk}$’s are all near 0 or all near 1 (i.e. if the $m$th node is pure)
• similar to the Gini index, can be used as an alternative

### IV. Method: Bagging

Bagging, or bootstrap aggregation, reduces the variance of a statistical learning method. Decision trees suffer from high variance, so we can reduce the variance by taking many training sets from the population.

Generate $B$ different bootstrapped data sets, train the method on the $b$th data set to obtain $\hat f ^ {*b} (x)$, then average all of the predictions to obtain the function obtained from bagging:

For qualitative responses, the overall prediction of the bagging tree is is the most commonly occurring class among the $B$ predictions.

$B$, the number of trees, is not a critical parameter! In practice, we just need to use a large B so that the error has settled down, which is typically a value greater than 100.

Out-of-bag Error Estimation is a straightforward way to estimate the test error of a bagged model.

• Each bagged tree makes use of about 2/3 of the data.
• For the $i$th observation, there are about $B$/3 predictions for $i$ in which $i$ is out-of-bag.
• Obtain a single prediction for the $i$th observation with an average (if regression) or majority vote (if classification)
• Calculate the out-of-bag MSE to estimate the test error. This is virtually equivalent to LOOCV.

Variable importance measures

• Note that bagging increases prediction accuracy, but lowers interpretability.
• To measure variable importance for regression, take the average over all $B$ trees of the total amount that RSS degreases due to splits over a given predictor.
• To measure variable importance for classification, take the average over all $B$ trees of that total amount that the Gini index decreases by splits over a given predictor.

### V. Method: Random Forests

Random Forests are an extension of bagging, where we generate $B$ different bootstrapped data sets. However, each time a split is considered, a random sample of $m$ predictors is chosen as split candidates. Usually, $m \approx \sqrt p$.

• If there is 1 very strong predictor, then bagging will not substantially reduce the variance of the tree fit.
• In random forests, $\frac{p - m}{p}$ predictors will not consider the strong predictor, which decorrelates the trees, making the average less variable and more reliable. In particular, use a small value of $m$ when there are a large number of correlated variables.
• Like bagging, any value of $B$ will not overfit the data, so choose $B$ such that the error has settled down.

### VI. Method: Boosting

Boosting is a general approach that can be used for many statistical learning methods, including decision trees. It works similarly to bagging, except trees are grown sequentially, and fit on a modified version of the original data set.

There are three tuning parameters with boosting:

1. The number of trees, $B$.
• choose with cross validation, since a large $B$ can lead to overfitting (slowly)
2. The shrinkage parameter, $\lambda$, which controls the rate at which boosting learns.
• typically set to .01 or .001
• note that a very small $\lambda$ can require a very large $B$ for good performance
3. The number of splits in each tree, $d$, which controls the complexity of the boosted trees
• often, $d = 1$ works well; this is when each tree is a stump with a single split
• $d$ is also the interaction depth, and controls the interaction order of boosted models

Algorithm: Boosting for Regression Trees

1. Set $\hat f(x) = 0$ , and $r_i = y_i$ for all $i$ in the training set
2. For $b = 1, 2, ..., B$:
• Fit a tree $\hat f^b$ with $d$ splits ($d+1$ terminal nodes) to the training data $(X, r)$
• Update $\hat f$ by adding a shrunken version of the new tree: $\hat f(x) \leftarrow \hat f(x) + \lambda \hat f^b(x_i)$
• Update the residuals: $r_i \leftarrow r_i - \lambda \hat f^b(x_i)$
3. Output the boosted model:

Boosting learns slowly by fitting decision trees to the residuals of the current tree. Trees are grown sequentially, and the shrinkage parameter $\lambda$ slows down the process.

Note: smaller trees can help with interpretability. For example, using stumps can also be interpreted as creating an additive model.

### VII. Applications in R

1. Recursive binary splitting is top-down because it begins at the top of the tree where all observations begin at a single region. It is greedy because at each step of the tree-building process, the best split is made at that step, rather than looking ahead and picking a split that will lead to a better tree in a future step.