Chapter 2DATA PREPROCESSING
- 2.1 Why do We Need to Preprocess the Data?
- 2.2 Data Cleaning
- 2.3 Handling Missing Data
- 2.4 Identifying Misclassifications
- 2.5 Graphical Methods for Identifying Outliers
- 2.6 Measures of Center and Spread
- 2.7 Data Transformation
- 2.8 Min-Max Normalization
- 2.9 Z-Score Standardization
- 2.10 Decimal Scaling
- 2.11 Transformations to Achieve Normality
- 2.12 Numerical Methods for Identifying Outliers
- 2.13 Flag Variables
- 2.14 Transforming Categorical Variables into Numerical Variables
- 2.15 Binning Numerical Variables
- 2.16 Reclassifying Categorical Variables
- 2.17 Adding an Index Field
- 2.18 Removing Variables that are Not Useful
- 2.19 Variables that Should Probably Not Be Removed
- 2.20 Removal of Duplicate Records
- 2.21 A Word About Id Fields
Chapter 1 introduced us to data mining, and the CRISP-DM standard process for data mining model development. In Phase 1 of the data mining process, business understanding or research understanding, businesses and researchers first enunciate project objectives, then translate these objectives into the formulation of a data mining problem definition, and finally prepare a preliminary strategy for achieving these objectives.
Here in this chapter, we examine the next two phases of the CRISP-DM standard process, data understanding and data preparation. We will show how to evaluate the quality of the data, clean the raw data, deal with missing data, and perform transformations on ...
Get Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.