Chapter 23

Tree Models

Tree models are computationally intensive methods that are used in situations where there are many explanatory variables and we would like guidance about which of them to include in the model. Often there are so many explanatory variables that we simply could not test them all, even if we wanted to invest the huge amount of time that would be necessary to complete such a complicated multiple regression exercise. Tree models are particularly good at tasks that might in the past have been regarded as the realm of multivariate statistics (e.g. classification problems). The great virtues of tree models are as follows:

  • They are very simple.
  • They are excellent for initial data inspection.
  • They give a very clear picture of the structure of the data.
  • They provide a highly intuitive insight into the kinds of interactions between variables.

It is best to begin by looking at a tree model in action, before thinking about how it works. Here is an air pollution example that we might want to analyze as a multiple regression. We begin by using tree, then illustrate the more modern function rpart (which stands for ‘recursive partitioning’)

Pollute <- read.table("c:\\temp\\Pollute.txt",header=T)
[1] "Pollution" "Temp" "Industry" "Population" "Wind"
[6] "Rain" "Wet.days"
model <- tree(Pollute)

You follow a path from the top of the tree (called, in defiance of gravity, the root ...

Get The R Book, 2nd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.