Chapter 20

Too Much of a Good Thing? Techniques for Reducing the Number of Variables

In data mining, having more data is often better. More data helps simplify problems by making it possible to build models and test their effectiveness without relying on sophisticated statistics or assumptions about distributions. More data helps avoid the problem of missing data, by making it possible to build more models. More variables give models more power, by making it possible to capture more nuances of customer behavior and to build stable models.

As any lover of dessert knows, more is not always better. The same may be true of data mining, particularly with regard to the number of variables. This chapter explicitly covers various methods for reducing the number of variables.

When you have many variables, the input data is likely to be sparse, meaning that many columns are dominated by just one or two values (such as zero or null). Some data mining techniques do not work well with a large number of variables. And many variables can increase the possibility of overfitting. These are some of the problems that can arise with too many input variables.

A common set of techniques for reducing the number of variables is to select the best input variables based on their ability to model the target. Some purists may think that using the target is cheating a little bit, but the method works well in practice, producing stable models. Many of these techniques have been touched on in early chapters. ...

Get Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.