Chapter 7. Cleaning, Transforming, and Augmenting Data

Most of the time, the data that we initially find, collect, or acquire doesn’t quite suit our needs in one way or another. The format is awkward, the data structure is wrong, or its units need to be adjusted. The data itself might contain errors, inconsistencies, or gaps. It may contain references we don’t understand or hint at additional possibilities that aren’t realized. Whatever the limitation may be, in our quest to use data as a source of insight, it is inevitable that we will have to clean, transform, and/or augment it in some way in order to get the most out of it.

Up until now, we have put off most of this work because we had more urgent problems to solve. In Chapter 4, our focus was on getting data out of a tricky file format and into something more accessible; in Chapter 6, our priority was thoroughly assessing the quality of our data, so we could make an informed decision about whether it was worth the investment of augmentation and analysis at all.

Now, however, it’s time to roll up our sleeves and begin what to me is sort of the second phase of data wrangling and quality work: preparing the data we have for the analysis we want to perform. Our data is in the table-type format we need, and we’ve determined that it’s of high enough quality to yield some useful insights—even if they are not precisely the ones we first imagined.

Since it’s obviously impossible to identify and address every possible problem or technique ...

Get Practical Python Data Wrangling and Data Quality now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.