Having performed a preliminary data characterization, it is now time to analyze further and transform the data set prior to starting any analysis. The data must be cleaned and translated into a form suitable for data analysis and data mining. This process will enable us to become familiar with the data and this familiarity will be beneficial to the analysis performed in step 3 (the implementation of the analysis). The following sections review some of the criteria and analysis that can be performed.
Since the data available for analysis may not have been originally collected with this project's goal in mind, it is important to spend time cleaning the data. It is also beneficial to understand the accuracy with which the data was collected as well as correcting any errors.
For variables measured on a nominal or ordinal scale (where there are a fixed number of possible values), it is useful to inspect all possible values to uncover mistakes and/or inconsistencies. Any assumptions made concerning possible values that the variable can take should be tested. For example, a variable Company may include a number of different spellings for the same company such as:
General Electric Company
General Elec. Co
Gen. Electric Company
General electric company
These different terms, where they refer to the same company, should be consolidated into one for analysis. In addition, subject matter expertise may be needed in ...