Chapter 5. Scrubbing Data

Two chapters ago, in the first step of the OSEMN model for data science, we looked at obtaining data from a variety of sources. This chapter is all about the second step: scrubbing data. You see, it’s quite rare that you can move directly from obtaining data to exploring or even modeling the data. There’s a plethora of reasons why your data first needs some cleaning, or scrubbing.

For starters, the data might not be in the desired format. For example, you may have obtained some JSON data from an API, but you need it to be in CSV format to create a visualization. Other common formats include plain text, HTML, and XML. Most command-line tools work with only one or two formats, so it’s important that you’re able to convert data from one format to another.

Once the data is in the desired format, there could still be issues like missing values, inconsistencies, weird characters, or unnecessary parts. You can fix these by applying filters, replacing values, and combining multiple files. The command line is especially well suited for these kind of transformations, because there are many specialized tools available, most of which can handle large amounts of data. In this chapter I’ll discuss classic tools such as grep and awk,1 and newer tools such as jq and pup.

Sometimes you can use the same command-line tool to perform several operations or multiple tools to perform the same operation. This chapter is structured more like a cookbook in that it focuses on the ...

Get Data Science at the Command Line, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.