3.4 DATA PREPARATION

3.4.1 Overview

Having performed a preliminary data characterization, it is now time to analyze further and transform the data set prior to starting any analysis. The data must be cleaned and translated into a form suitable for data analysis and data mining. This process will enable us to become familiar with the data and this familiarity will be beneficial to the analysis performed in step 3 (the implementation of the analysis). The following sections review some of the criteria and analysis that can be performed.

3.4.2 Cleaning the Data

Since the data available for analysis may not have been originally collected with this project's goal in mind, it is important to spend time cleaning the data. It is also beneficial to understand the accuracy with which the data was collected as well as correcting any errors.

For variables measured on a nominal or ordinal scale (where there are a fixed number of possible values), it is useful to inspect all possible values to uncover mistakes and/or inconsistencies. Any assumptions made concerning possible values that the variable can take should be tested. For example, a variable Company may include a number of different spellings for the same company such as:

General Electric Company

General Elec. Co

GE

Gen. Electric Company

General electric company

G.E. Company

These different terms, where they refer to the same company, should be consolidated into one for analysis. In addition, subject matter expertise may be needed in ...

Get Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.