Chapter 12. Wrangle and Mangle Data

If you torture the data enough, nature will always confess.

Ronald Coase

Up to this point, we’ve talked mainly about the Python language itself—its data types, code structures, syntax, and so on. The rest of this book is about application of these to real-world problems.

In this chapter, you’ll learn many practical techniques for taming data. Sometimes, this is called data munging, or the more businesslike ETL (extract/transform/load) of the database world. Although programming books usually don’t cover the topic explicitly, programmers spend a lot of time trying to mold data into the right shape for their purposes.

The specialty called data science has become very popular in the past few years. A Harvard Business Review article called data scientist the “sexiest job of the 21st century.” If this meant in demand and well paying, then okay, but there’s also more than enough drudgery. Data science goes beyond the ETL requirements of databases, often involving machine learning to unearth insights that were not visible to human eyes.

I’ll start with basic data formats and then work up to the most useful new tools for data science.

Data formats fall roughly into two categories: text and binary. Python strings are used for text data, and this chapter includes string information that we’ve skipped so far:

  • Unicode characters

  • Regular expression pattern matching.

Then, we jump to binary data, and two more of Python’s built-in types:

  • Bytes for ...

Get Introducing Python, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.