CHAPTER 6
Data-Mining Bias: The Fool’s Gold of Objective TA
In rule data mining, many rules are back tested and the rule with the best observed performance is selected. That is to say, data mining involves a performance competition that leads to a winning rule being picked. The problem is that the winning rule’s observed performance that allowed it to be picked over all other rules systematically overstates how well the rule is likely to perform in the future. This systematic error is the data-mining bias.
Despite this problem, data mining is a useful research approach. It can be proven mathematically that, out of all the rules tested, the rule with the highest observed performance is the rule most likely to do the best in the future, provided a sufficient number of observations are used to compute performance statistics.1 In other words, it pays to data mine even though the best rule’s observed performance is positively biased. This chapter explains why the bias occurs, why it must be taken into account when making inferences about the future performance of the best rule, and how such inferences can be made.
I begin by introducing this somewhat abstract topic with several anecdotes, only one of which is related to rule data mining. They are appetizers intended to make later material more digestible. Readers who want to start on the main course immediately may choose to skip to the section titled “Data Mining.”
The following definitions will be used throughout this chapter and ...