9.4. Constructing the Training, Validation, and Test Sets

At this point, Mi-Ling has accumulated enough knowledge to realize that she should be able to build a strong classification model. She is ready to move on to the Model Relationships step of the Visual Six Sigma Data Analysis Process. However, she anticipates that the marketing study will result in a large and unruly data set, probably with many outliers, some missing values, irregular distributions, and some categorical data. It will not be nearly as small or as clean as her practice data set. So, she wants to consider modeling from a data-mining perspective.

From her previous experience, Mi-Ling knows that some data-mining techniques, such as recursive partitioning and neural nets, fit highly parameterized nonlinear models that have the potential to fit the anomalies and noise in a data set, as well as the signal. These data-mining techniques do not allow for variable selection based on hypothesis tests, which, in classical modeling, help the analyst choose models that do not overfit or underfit the data.

To balance the competing forces of overfitting and underfitting in data-mining efforts, one often divides the available data into at least two and sometimes three distinct sets. Since the tendency to overfit data may introduce bias into models fit and validated using the same data, just a portion of the data, called the training set, is used to construct several potential models. One then assesses the performance of these ...

Get Visual Six Sigma: Making Data Analysis Lean now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.