Chapter 7
Decision Trees
Decision trees are one of the most powerful directed data mining techniques, because you can use them on such a wide range of problems and they produce models that explain how they work. Decision trees are related to table lookup models. In the simple table lookup model described in Chapter 6, such as RFM cubes, the cells are defined in advance by splitting each dimension into an arbitrary number of evenly spaced partitions. Then, something of interest — a response rate or average order size, for instance — is measured in each cell. New records are scored by determining which cell they belong to.
Decision trees extend this idea in two ways. First, decision trees recursively split data into smaller and smaller cells which are increasingly “pure” in the sense of having similar values of the target. The decision tree algorithm treats each cell independently. To find a new split, the algorithm tests splits based on all available variables. In doing so, decision trees choose the most important variables for the directed data mining task. This means that you can use decision trees for variable selection as well as for building models.
Second, the decision tree uses the target variable to determine how each input should be partitioned. In the end, the decision tree breaks the data into segments, defined by the splitting rules at each step. Taken together, the rules for all the segments form the decision tree model.
A model that can be expressed as a collection ...