Chapter 4Data Munging: String Manipulation, Regular Expressions, and Data Cleaning

This chapter is about some of the pathologies that you will see in real-world data. It talks about some of the most common (and notorious!) ones, where they come from, and how they can be addressed.

Data pathologies come in roughly two types. The first are formatting issues. This includes inconsistent capitalization, extraneous whitespaces, and things of that nature. Often, these are straightforward to solve with appropriate preprocessing of the data. The second category involves the actual content of the data. Duplicate entries, major outliers, and NULL values are all examples. It often requires some detective work to figure out what these issues mean in a particular situation and hence how they should be addressed.

My goals in this chapter are twofold. Firstly, I want to give you an appreciation for the breadth of issues that can be present in real-world data and equip you to quickly identify and diagnose problems. Secondly, I want to teach you tools that can be used to solve the problems. Specifically, I will discuss various types of string manipulation.

Manipulating strings of text might seem boring at first glance, but it's one of the most powerful tools a data scientist can have. I would put it on par with machine learning itself. String manipulation can be used to address any data formatting problems, and in many cases, it is the only suitable solution. But it is also invaluable for creating ...

Get The Data Science Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.