305
Chapter 15
Post-Pruning in Decision
Tree Induction Using
Multiple Performance
Measures
Kweku-Muata Osei-Bryson
Contents
Introduction ..................................................................................................... 306
Overview of Performance Measures for DTs ..................................................... 308
Accuracy ...................................................................................................... 308
Stability ........................................................................................................309
Simplicity .....................................................................................................309
Simplicity Based on Number of Leaves (SIMPL
Leaf
) .................................309
Simplicity Based on Average Chain Length (SIMPL
Rule
) ........................... 310
Concave, Piecewise Linear Simplicity Value Functions ............................. 310
Discriminatory Power ................................................................................... 311
Multiobjective Approach to Ranking the Sub-Trees ........................................... 312
Overview of MCDM Problems ....................................................................312
Weighting Model ..........................................................................................313
Estimation of Weights ..................................................................................313
MIP for Post-Pruning with Multiple Measures .................................................. 314
Multiobjective MIP Formulation ..................................................................314
Procedure for Generating the Optimal Sub-Tree ........................................... 316
306Kweku-Muata Osei-Bryson
Abstract: e decision tree (DT) induction process has two major phases: the growth
phase and the pruning phase. e pruning phase aims to generalize the DT that was
generated in the growth phase by generating a sub-tree that avoids overtting to the
training data. Most post-pruning methods essentially address post-pruning as if it
were a single-objective problem (i.e., maximize validation accuracy), and address the
issue of simplicity (in terms of the number of leaves) only in the case of a tie. However,
it is well known that apart from accuracy there are other performance measures (e.g.,
stability, simplicity, and interpretability) that are important for evaluating the DT
quality. In this chapter, we propose that multiobjective evaluation be done during the
post-pruning phase in order to select the best sub-tree, and propose a procedure for
obtaining the optimal sub-tree based on user provided preferences and values.
Keywords: Decision Tree, post-pruning, performance measures, multi-criteria
decision analysis, mixed integer programming, analytic hierarchy process
Introduction
e decision tree (DT) induction is one of the most popular data mining (DM)
technique (e.g., Witten and Frank 2000), with applications in various elds. A DT is
a tree structure representation of the given decision problem such that each nonleaf
node is associated with one of the decision variables, each branch from a nonleaf node
is associated with a subset of the values of the corresponding decision variable, and
each leaf node is associated with a value of the target (or dependent) variable. ere
are two main types of DTs: (1) classication trees and (2) regression trees (RTs). For
a classication tree, the target variable takes its values from a discrete domain, and for
each leaf node, the DT associates a probability or each class (i.e., value of the target
variable). e class that is assigned to a given leaf node of the classication tree results
from a form of majority voting in which the winning class is the one that provides
the largest class probability. On the other hand, the target variable of an RT takes its
values from a continuous domain (numeric), and for each leaf, the RT associates the
mean value of the target variable. In this chapter, we will focus on the classication
tree, which is the most commonly used type of DT, and henceforth, in this chapter,
whenever we use the term “decision tree” we will be referring to a classication tree.
ere are two major phases of the DT induction process: the growth phase and
the pruning phase (e.g., Kim and Koehler 1995). e growth phase involves a recur-
sive partitioning of the training data resulting in a DT such that either each leaf
node is associated with a single class or further partitioning of the given leaf would
Illustrative Example ...........................................................................................317
Conclusion ........................................................................................................320
Acknowledgment ..............................................................................................323
References .........................................................................................................323
Appendix ...........................................................................................................325

Get Knowledge Discovery Process and Methods to Enhance Organizational Performance now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.