book

Web Scraping with Python

by Ryan Mitchell

July 2015

Intermediate to advanced

256 pages

6h 28m

English

O'Reilly Media, Inc.

Read now

Unlock full access

What Is Web Scraping?Why Web Scraping?About This BookConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
ConnectingAn Introduction to BeautifulSoupInstalling BeautifulSoupRunning BeautifulSoupConnecting Reliably
You Don’t Always Need a HammerAnother Serving of BeautifulSoupfind() and findAll() with BeautifulSoupOther BeautifulSoup ObjectsNavigating TreesRegular ExpressionsRegular Expressions and BeautifulSoupAccessing AttributesLambda ExpressionsBeyond BeautifulSoup
Traversing a Single DomainCrawling an Entire SiteCollecting Data Across an Entire SiteCrawling Across the InternetCrawling with Scrapy
How APIs WorkCommon ConventionsMethodsAuthenticationResponsesAPI CallsEcho NestA Few ExamplesTwitterGetting StartedA Few ExamplesGoogle APIsGetting StartedA Few ExamplesParsing JSONBringing It All Back HomeMore About APIs
Media FilesStoring Data to CSVMySQLInstalling MySQLSome Basic CommandsIntegrating with PythonDatabase Techniques and Good Practice“Six Degrees” in MySQLEmail
Document EncodingTextText Encoding and the Global InternetCSVReading CSV FilesPDFMicrosoft Word and .docx
Cleaning in CodeData NormalizationCleaning After the FactOpenRefine

Summarizing DataMarkov ModelsSix Degrees of Wikipedia: ConclusionNatural Language ToolkitInstallation and SetupStatistical Analysis with NLTKLexicographical Analysis with NLTKAdditional Resources
Python Requests LibrarySubmitting a Basic FormRadio Buttons, Checkboxes, and Other InputsSubmitting Files and ImagesHandling Logins and CookiesHTTP Basic Access AuthenticationOther Form Problems
A Brief Introduction to JavaScriptCommon JavaScript LibrariesAjax and Dynamic HTMLExecuting JavaScript in Python with SeleniumHandling RedirectsA Final Note on JavaScript
Overview of LibrariesPillowTesseractNumPyProcessing Well-Formatted TextScraping Text from Images on WebsitesReading CAPTCHAs and Training TesseractTraining TesseractRetrieving CAPTCHAs and Submitting Solutions
A Note on EthicsLooking Like a HumanAdjust Your HeadersHandling CookiesTiming Is EverythingCommon Form Security FeaturesHidden Input Field ValuesAvoiding HoneypotsThe Human Checklist
An Introduction to TestingWhat Are Unit Tests?Python unittestTesting WikipediaTesting with SeleniumInteracting with the SiteUnittest or Selenium?
Why Use Remote Servers?Avoiding IP Address BlockingPortability and ExtensibilityTorPySocksRemote HostingRunning from a Website Hosting AccountRunning from the CloudAdditional ResourcesMoving Forward
Installation and “Hello, World!”
Trademarks, Copyrights, Patents, Oh My!Copyright LawTrespass to ChattelsThe Computer Fraud and Abuse Actrobots.txt and Terms of ServiceThree Web ScraperseBay versus Bidder’s Edge and Trespass to ChattelsUnited States v. Auernheimer and The Computer Fraud and Abuse ActField v. Google: Copyright and robots.txt

Content preview from Web Scraping with Python

Chapter 7. Cleaning Your Dirty Data

So far in this book we’ve ignored the problem of badly formatted data by using generally well-formatted data sources, dropping data entirely if it deviated from what we were expecting. But often, in web scraping, you can’t be too picky about where you get your data from.

Due to errant punctuation, inconsistent capitalization, line breaks, and misspellings, dirty data can be a big problem on the Web. In this chapter, I’ll cover a few tools and techniques to help you prevent the problem at the source by changing the way you write code, and clean the data once it’s in the database.

Cleaning in Code

Just as you write code to handle overt exceptions, you should practice defensive coding to handle the unexpected.

In linguistics, an n-gram is a sequence of n words used in text or speech. When doing natural-language analysis, it can often be handy to break up a piece of text by looking for commonly used n-grams, or recurring sets of words that are often used together.

In this section, we will focus on obtaining properly formatted n-grams rather than using them to do any analysis. Later, in Chapter 8, you can see 2-grams and 3-grams in action to do text summarization and analysis.

The following will return a list of 2-grams found in the Wikipedia article on the Python programming language:

from urllib.request import urlopen
from bs4 import BeautifulSoup

def getNgrams(input, n):
  input = input.split(' ')
  output = []
  for i in range(len(input ...