20Classification and Regression Trees
Classification and regression tree models (CARTs) are computationally intensive methods that are used in situations where there are many explanatory variables and we would like guidance about, possibly, including them in the model: classification trees are where the outcome is discrete and regression trees where the outcome is continuous. Often, there are so many explanatory variables that we simply could not investigate them all, even if we wanted to invest the huge amount of time that would be necessary to complete such a complicated multiple regression exercise. The great virtues of tree models are as follows:
- they are very simple to implement, understand, and interpret;
- they are excellent for initial data inspection;
- they give a very clear picture of the structure of the data;
- they provide a highly intuitive insight into the kinds of interactions between variables.
Let us begin by looking at a tree model in action, before thinking about how it works. Here is an air pollution example that we might want to analyse as a multiple regression: the outcome is continuous (Pollution
) and the covariates are self‐explanatory, although the units used are a little opaque in places. We will begin by using the package tree
, then illustrate the more modern package rpart
(Ripley, 2019), which stands for recursive partitioning, which is what is going on here. The regression tree is displayed in Figure 20.1:
Get The R Book, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.