So far in this book, you’ve ignored the problem of badly formatted data by using generally well-formatted data sources, dropping data entirely if it deviated from what you were expecting. But often, in web scraping, you can’t be too picky about where you get your data from, or what it looks like.
Because of errant punctuation, inconsistent capitalization, line breaks, and misspellings, dirty data can be a big problem on the web. This chapter covers a few tools and techniques to help you prevent the problem at the source by changing the way you write code, and clean the data after it’s in the database.
Just as you write code to handle overt exceptions, you should practice defensive coding to handle the unexpected.
In linguistics, an n-gram is a sequence of n words used in text or speech. When doing natural language analysis, it can often be handy to break up a piece of text by looking for commonly used n-grams, or recurring sets of words that are often used together.
This section focuses on obtaining properly formatted n-grams rather than using them to do any analysis. Later, in Chapter 9, you can see 2-grams and 3-grams in action to do text summarization and analysis.
The following returns a list of 2-grams found in the Wikipedia article on the Python programming language: