7.4 CLASSIFICATION AND REGRESSION TREES

7.4.1 Overview

In Chapter 6, decision trees were described as a way of grouping observations based on specific values or ranges of descriptor variables. For example, the tree in Figure 7.19 organizes a set of observations based on the number of cylinders (Cylinders) of the car. The tree was constructed using the variable MPG (miles per gallon) as the response variable. This variable was used to guide how the tree was constructed, resulting in groupings that characterize car fuel efficiency. The terminal nodes of the tree (A, B, and C) show a partitioning of cars into sets with good (node A), moderate (node B), and poor (node C) fuel efficiencies.

images

Figure 7.19. Decision tree classifying cars

Each terminal node is a mutually exclusive set of observations, that is, there is no overlap between nodes A, B, or C. The criteria for inclusion in each of these nodes are defined by the set of branch points used to partition the data. For example, terminal node B is defined as observations where Cylinders are greater or equal to five and Cylinders are less than seven.

Decision trees can be used as both classification and regression prediction models. Decision trees that are built to predict a continuous response variable are called regression trees and decision trees built to predict a categorical response are called classification trees. During the learning ...

Get Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.