Chapter 7. Cleaning Your Dirty Data
So far in this book we’ve ignored the problem of badly formatted data by using generally well-formatted data sources, dropping data entirely if it deviated from what we were expecting. But often, in web scraping, you can’t be too picky about where you get your data from.
Due to errant punctuation, inconsistent capitalization, line breaks, and misspellings, dirty data can be a big problem on the Web. In this chapter, I’ll cover a few tools and techniques to help you prevent the problem at the source by changing the way you write code, and clean the data once it’s in the database.
Cleaning in Code
Just as you write code to handle overt exceptions, you should practice defensive coding to handle the unexpected.
In linguistics, an n-gram is a sequence of n words used in text or speech. When doing natural-language analysis, it can often be handy to break up a piece of text by looking for commonly used n-grams, or recurring sets of words that are often used together.
In this section, we will focus on obtaining properly formatted n-grams rather than using them to do any analysis. Later, in Chapter 8, you can see 2-grams and 3-grams in action to do text summarization and analysis.
The following will return a list of 2-grams found in the Wikipedia article on the Python programming language:
from
urllib.request
import
urlopen
from
bs4
import
BeautifulSoup
def
getNgrams
(
input
,
n
):
input
=
input
.
split
(
' '
)
output
=
[]
for
i
in
range
(
len
(
input ...
Get Web Scraping with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.