Chapter 11. Working with Dirty Data

So far in this book, I’ve ignored the problem of badly formatted data by using generally well-formatted data sources, dropping data entirely if it deviated from what was expected. But, in web scraping, you often can’t be too picky about where you get your data, or what it looks like.

Because of errant punctuation, inconsistent capitalization, line breaks, and misspellings, dirty data can be a big problem on the web. This chapter covers a few tools and techniques to help you prevent the problem at the source by changing the way you write code and cleaning the data after it’s in the database.

This is the chapter where web scraping intersects with its close relative, data science. While the job title of “data scientist” might conjure mental images of cutting-edge programming techniques and advanced mathematics, the truth is that a lot of it is grunt work. Someone has to clean and normalize these millions of records before they can be used to build a machine learning model, and that person is the data scientist.

Get Web Scraping with Python, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.