In Chapter 2, we looked at the first step of the OSEMN model for data science, how to obtain data from a variety of sources. It’s not uncommon for this data to have missing values, inconsistencies, errors, weird characters, or uninteresting columns. Sometimes we only need a specific portion of the data. And sometimes we need the data to be in a different format. In those cases, we have to clean, or scrub, the data before we can move on to the third step: exploring data.
The data we obtained in Chapter 3 can come in a variety of formats. The most common ones are plain text, CSV, JSON, and HTML/XML. Because most command-line tools operate on one format only, it is worthwhile to be able to convert data from one format to another.
CSV, which is the main format we’re working with in this chapter, is actually not the easiest format to work with. Many CSV data sets are broken or incompatible with each other because there is no standard syntax, unlike XML and JSON.
Once our data is in the format we want it to be, we can apply common scrubbing operations. These include filtering, replacing, and merging data. The command line is especially well-suited for these kind of operations, as there exist many powerful command-line tools that are optimized for handling large amounts of data. Tools that we’ll discuss in this chapter include classic ones such as:
cut (Ihnat, MacKenzie, & Meyering, 2012) and
sed (Fenlason, Lord, Pizzini, & Bonzini, 2012), and newer ones such ...