book

Web Scraping with Python

by Ryan Mitchell

July 2015

Intermediate to advanced

256 pages

6h 28m

English

O'Reilly Media, Inc.

Read now

Unlock full access

What Is Web Scraping?Why Web Scraping?About This BookConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
ConnectingAn Introduction to BeautifulSoupInstalling BeautifulSoupRunning BeautifulSoupConnecting Reliably
You Don’t Always Need a HammerAnother Serving of BeautifulSoupfind() and findAll() with BeautifulSoupOther BeautifulSoup ObjectsNavigating TreesRegular ExpressionsRegular Expressions and BeautifulSoupAccessing AttributesLambda ExpressionsBeyond BeautifulSoup
Traversing a Single DomainCrawling an Entire SiteCollecting Data Across an Entire SiteCrawling Across the InternetCrawling with Scrapy
How APIs WorkCommon ConventionsMethodsAuthenticationResponsesAPI CallsEcho NestA Few ExamplesTwitterGetting StartedA Few ExamplesGoogle APIsGetting StartedA Few ExamplesParsing JSONBringing It All Back HomeMore About APIs
Media FilesStoring Data to CSVMySQLInstalling MySQLSome Basic CommandsIntegrating with PythonDatabase Techniques and Good Practice“Six Degrees” in MySQLEmail
Document EncodingTextText Encoding and the Global InternetCSVReading CSV FilesPDFMicrosoft Word and .docx
Cleaning in CodeData NormalizationCleaning After the FactOpenRefine

Summarizing DataMarkov ModelsSix Degrees of Wikipedia: ConclusionNatural Language ToolkitInstallation and SetupStatistical Analysis with NLTKLexicographical Analysis with NLTKAdditional Resources
Python Requests LibrarySubmitting a Basic FormRadio Buttons, Checkboxes, and Other InputsSubmitting Files and ImagesHandling Logins and CookiesHTTP Basic Access AuthenticationOther Form Problems
A Brief Introduction to JavaScriptCommon JavaScript LibrariesAjax and Dynamic HTMLExecuting JavaScript in Python with SeleniumHandling RedirectsA Final Note on JavaScript
Overview of LibrariesPillowTesseractNumPyProcessing Well-Formatted TextScraping Text from Images on WebsitesReading CAPTCHAs and Training TesseractTraining TesseractRetrieving CAPTCHAs and Submitting Solutions
A Note on EthicsLooking Like a HumanAdjust Your HeadersHandling CookiesTiming Is EverythingCommon Form Security FeaturesHidden Input Field ValuesAvoiding HoneypotsThe Human Checklist
An Introduction to TestingWhat Are Unit Tests?Python unittestTesting WikipediaTesting with SeleniumInteracting with the SiteUnittest or Selenium?
Why Use Remote Servers?Avoiding IP Address BlockingPortability and ExtensibilityTorPySocksRemote HostingRunning from a Website Hosting AccountRunning from the CloudAdditional ResourcesMoving Forward
Installation and “Hello, World!”
Trademarks, Copyrights, Patents, Oh My!Copyright LawTrespass to ChattelsThe Computer Fraud and Abuse Actrobots.txt and Terms of ServiceThree Web ScraperseBay versus Bidder’s Edge and Trespass to ChattelsUnited States v. Auernheimer and The Computer Fraud and Abuse ActField v. Google: Copyright and robots.txt

Content preview from Web Scraping with Python

Chapter 6. Reading Documents

It is tempting to think of the Internet primarily as a collection of text-based websites interspersed with newfangled web 2.0 multimedia content that can mostly be ignored for the purposes of web scraping. However, this ignores what the Internet most fundamentally is: a content-agnostic vehicle for transmitting files.

Although the Internet has been around in some form or another since the late 1960s, HTML didn’t debut until 1992. Until then, the Internet consisted mostly of email and file transmission; the concept of web pages as we know them today didn’t really exist. In other words, the Internet is not a collection of HTML files. It is a collection of information, with HTML files often being used as a frame to showcase it. Without being able to read a variety of document types, including text, PDF, images, video, email, and more, we are missing out on a huge part of the available data.

This chapter covers dealing with documents, whether you’re downloading them to a local folder or reading them and extracting data. We’ll also take a look at dealing with various types of text encoding, which can make it possible to even read foreign-language HTML pages.

Document Encoding

A document’s encoding tells applications—whether they are your computer’s operating system or your own Python code—how to read it. This encoding can usually be deduced from its file extension, although this file extension is not mandated by its encoding. I could, for example, save ...