Preparing the data is one of the most time-consuming parts of a data analysis/data mining project. This chapter outlines concepts and steps necessary to prepare a data set prior to beginning data analysis or data mining. The way in which the data is collected and prepared is critical to the confidence with which decisions can be made. The data needs to be merged into a table and this may involve integration of the data from multiple sources. Once the data is in a tabular format, it should be fully characterized as discussed in the previous chapter. The data should be cleaned by resolving ambiguities and errors, removing redundant and problematic data, and eliminating columns of data irrelevant to the analysis. New columns of data may need to be calculated. Finally, the table should be divided, where appropriate, into subsets that either simplify the analysis or allow specific questions to be answered more easily.
In addition to the work done preparing the data, it is important to record the details about the steps that were taken and why they were done. This not only provides documentation of the activities performed so far, but it also provides a methodology to apply to similar data sets in the future. In addition, when validating the results, these records will be important for recalling assumptions made about the data.
The following chapter outlines the process of preparing data for analysis. It includes methods for identifying ...