4Data Munging: String Manipulation, Regular Expressions, and Data Cleaning
This chapter is about some of the pathologies that you will see in real‐world data. It talks about some of the most common (and notorious!) ones, where they come from, and how they can be addressed.
Data pathologies come in roughly two types. The first are formatting issues. This includes inconsistent capitalization, extraneous whitespaces, and things of that nature. Often, these are straightforward to solve with appropriate preprocessing of the data. The second category involves the actual content of the data. Duplicate entries, major outliers, and NULL values are all examples. It often requires some detective work to figure out what these issues mean in a particular situation and, hence, how they should be addressed.
My goals in this chapter are twofold. First, I want to give you an appreciation for the breadth of issues that can be present in real‐world data and equip you to quickly identify and diagnose problems. Second, I want to teach you tools that can be used to solve the problems. Specifically, I will discuss various types of string manipulation.
Manipulating strings of text might seem boring at first glance, but it’s one of the most powerful tools a data scientist can have. I would put it on par with machine learning itself. String manipulation can be used to address any data formatting problems, and in many cases, it is the only suitable solution. But, it is also invaluable for creating scripts ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access