21

Tree Models

Tree models are computationally intensive methods that are used in situations where there are many explanatory variables and we would like guidance about which of them to include in the model. Often there are so many explanatory variables that we simply could not test them all, even if we wanted to invest the huge amount of time that would be necessary to complete such a complicated multiple regression exercise. Tree models are particularly good at tasks that might in the past have been regarded as the realm of multivariate statistics (e.g. classification problems). The great virtues of tree models are as follows:

  • They are very simple.
  • They are excellent for initial data inspection.
  • They give a very clear picture of the structure of the data.
  • They provide a highly intuitive insight into the kinds of interactions between variables.

It is best to begin by looking at a tree model in action, before thinking about how it works. Here is the air pollution example that we have worked on already as a multiple regression (see p. 311):

install.packages("tree")
library(tree)
Pollute<-read.table("c:\\temp\\Pollute.txt",header=T)
attach(Pollute)
names(Pollute)

[1] "Pollution" "Temp" "Industry" "Population" "Wind"
[6] "Rain"    "Wet.days"

model<-tree(Pollute)
plot(model)
text(model)

You follow a path from the top of the tree (called, in defiance of gravity, the root) and proceed to one of the terminal nodes (called a leaf) by following a succession of rules (called splits). The ...

Get The R Book now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.