Chapter 7Data Handling in Practice

In our experience, a data cleaning project arises out of a modeling or data exploration problem. We are given some data (or perhaps a description of data that the project sponsor plans to eventually provide) and, usually, a problem to be solved. There is no fixed method for undertaking a data cleaning project, but we think of the process as having four parts: acquiring and reading the data, actually cleaning the data, combining the data (when it comes from multiple sources), and preparing the data for analysis. (We sometimes use “data cleaning” in both a broad sense and also as the name of a specific set of tasks. Here, we are using “data handling” as the umbrella term for these four tasks.) Of course, the “cleaning” part is never really finished, and often the most important cleaning tasks are discovered as data sets are combined, or even as the modeling proceeds. In this chapter, we describe the tasks associated with each of the four parts of data cleaning. Then, we emphasize the importance of reproducibility and documentation and give a detailed example at the end.

7.1 Acquiring and Reading Data

Acquiring data is, of course, the act of actually taking delivery of the data. Very often the final data, the data that will be used for building models, will come not in one big file but from a number of sources and in varying formats. So it is important for the data cleaner to be prepared to read in text, spreadsheet data, XML, JSON, and to handle ...

Get A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.