Chapter 2. Overview of the Data Mining Process
In Ka˙re this chapter we give an overview of the steps involved in data mining, starting from a clear goal definition and ending with model deployment. The general steps are shown schematically in Figure 2.1. We also discuss issues related to data collection, cleaning, and preprocessing. We explain the notion of data partitioning, where methods are trained on a set of training data and then their performance is evaluated on a separate set of validation data, and how this practice helps avoid overfitting. Finally, we illustrate the steps of model building by applying them to data.
Figure 2.1. SCHEMATIC OF THE DATA MODELING PROCESS
In Chapter 1 we saw some very general definitions of data mining. In this chapter we introduce the variety of methods sometimes referred to as data mining. The core of this book focuses on what has come to be called predictive analytics, the tasks of classification and prediction that are becoming key elements of a "business intelligence" function in most large firms. These terms are described and illustrated below.
Not covered in this book to any great extent are two simpler database methods that are sometimes considered to be data mining techniques: (1) OLAP (online analytical processing) and (2) SQL (structured query language). OLAP and SQL searches on databases are descriptive in nature ("find all ...