book

Web Scraping with Python

by Ryan Mitchell

July 2015

Intermediate to advanced

256 pages

6h 28m

English

O'Reilly Media, Inc.

Read now

Unlock full access

What Is Web Scraping?Why Web Scraping?About This BookConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
ConnectingAn Introduction to BeautifulSoupInstalling BeautifulSoupRunning BeautifulSoupConnecting Reliably
You Don’t Always Need a HammerAnother Serving of BeautifulSoupfind() and findAll() with BeautifulSoupOther BeautifulSoup ObjectsNavigating TreesRegular ExpressionsRegular Expressions and BeautifulSoupAccessing AttributesLambda ExpressionsBeyond BeautifulSoup
Traversing a Single DomainCrawling an Entire SiteCollecting Data Across an Entire SiteCrawling Across the InternetCrawling with Scrapy
How APIs WorkCommon ConventionsMethodsAuthenticationResponsesAPI CallsEcho NestA Few ExamplesTwitterGetting StartedA Few ExamplesGoogle APIsGetting StartedA Few ExamplesParsing JSONBringing It All Back HomeMore About APIs
Media FilesStoring Data to CSVMySQLInstalling MySQLSome Basic CommandsIntegrating with PythonDatabase Techniques and Good Practice“Six Degrees” in MySQLEmail
Document EncodingTextText Encoding and the Global InternetCSVReading CSV FilesPDFMicrosoft Word and .docx
Cleaning in CodeData NormalizationCleaning After the FactOpenRefine

Summarizing DataMarkov ModelsSix Degrees of Wikipedia: ConclusionNatural Language ToolkitInstallation and SetupStatistical Analysis with NLTKLexicographical Analysis with NLTKAdditional Resources
Python Requests LibrarySubmitting a Basic FormRadio Buttons, Checkboxes, and Other InputsSubmitting Files and ImagesHandling Logins and CookiesHTTP Basic Access AuthenticationOther Form Problems
A Brief Introduction to JavaScriptCommon JavaScript LibrariesAjax and Dynamic HTMLExecuting JavaScript in Python with SeleniumHandling RedirectsA Final Note on JavaScript
Overview of LibrariesPillowTesseractNumPyProcessing Well-Formatted TextScraping Text from Images on WebsitesReading CAPTCHAs and Training TesseractTraining TesseractRetrieving CAPTCHAs and Submitting Solutions
A Note on EthicsLooking Like a HumanAdjust Your HeadersHandling CookiesTiming Is EverythingCommon Form Security FeaturesHidden Input Field ValuesAvoiding HoneypotsThe Human Checklist
An Introduction to TestingWhat Are Unit Tests?Python unittestTesting WikipediaTesting with SeleniumInteracting with the SiteUnittest or Selenium?
Why Use Remote Servers?Avoiding IP Address BlockingPortability and ExtensibilityTorPySocksRemote HostingRunning from a Website Hosting AccountRunning from the CloudAdditional ResourcesMoving Forward
Installation and “Hello, World!”
Trademarks, Copyrights, Patents, Oh My!Copyright LawTrespass to ChattelsThe Computer Fraud and Abuse Actrobots.txt and Terms of ServiceThree Web ScraperseBay versus Bidder’s Edge and Trespass to ChattelsUnited States v. Auernheimer and The Computer Fraud and Abuse ActField v. Google: Copyright and robots.txt

Content preview from Web Scraping with Python

Chapter 2. Advanced HTML Parsing

When Michelangelo was asked how he could sculpt a work of art as masterful as his David, he is famously reported to have said: “It is easy. You just chip away the stone that doesn’t look like David.”

Although web scraping is unlike marble sculpting in most other respects, we must take a similar attitude when it comes to extracting the information we’re seeking from complicated web pages. There are many techniques to chip away the content that doesn’t look like the content that we’re searching for, until we arrive at the information we’re seeking. In this chapter, we’ll take look at parsing complicated HTML pages in order to extract only the information we’re looking for.

You Don’t Always Need a Hammer

It can be tempting, when faced with a Gordian Knot of tags, to dive right in and use multiline statements to try to extract your information. However, keep in mind that layering the techniques used in this section with reckless abandon can lead to code that is difficult to debug, fragile, or both. Before getting started, let’s take a look at some of the ways you can avoid altogether the need for advanced HTML parsing!

Let’s say you have some target content. Maybe it’s a name, statistic, or block of text. Maybe it’s buried 20 tags deep in an HTML mush with no helpful tags or HTML attributes to be found. Let’s say you dive right in and write something like the following line to attempt extraction:

bsObj.findAll("table")[4].findAll("tr")[2].find(

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781491910283Errata Page Supplemental Content

Web Scraping with Python

by Ryan Mitchell

Chapter 2. Advanced HTML Parsing

You Don’t Always Need a Hammer

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Hands-On Web Scraping with Python

Python Web Scraping Cookbook

Python Web Scraping - Second Edition