book

Web Scraping with Python

by Ryan Mitchell

July 2015

Intermediate to advanced

256 pages

6h 28m

English

O'Reilly Media, Inc.

Read now

Unlock full access

What Is Web Scraping?Why Web Scraping?About This BookConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
ConnectingAn Introduction to BeautifulSoupInstalling BeautifulSoupRunning BeautifulSoupConnecting Reliably
You Don’t Always Need a HammerAnother Serving of BeautifulSoupfind() and findAll() with BeautifulSoupOther BeautifulSoup ObjectsNavigating TreesRegular ExpressionsRegular Expressions and BeautifulSoupAccessing AttributesLambda ExpressionsBeyond BeautifulSoup
Traversing a Single DomainCrawling an Entire SiteCollecting Data Across an Entire SiteCrawling Across the InternetCrawling with Scrapy
How APIs WorkCommon ConventionsMethodsAuthenticationResponsesAPI CallsEcho NestA Few ExamplesTwitterGetting StartedA Few ExamplesGoogle APIsGetting StartedA Few ExamplesParsing JSONBringing It All Back HomeMore About APIs
Media FilesStoring Data to CSVMySQLInstalling MySQLSome Basic CommandsIntegrating with PythonDatabase Techniques and Good Practice“Six Degrees” in MySQLEmail
Document EncodingTextText Encoding and the Global InternetCSVReading CSV FilesPDFMicrosoft Word and .docx
Cleaning in CodeData NormalizationCleaning After the FactOpenRefine

Summarizing DataMarkov ModelsSix Degrees of Wikipedia: ConclusionNatural Language ToolkitInstallation and SetupStatistical Analysis with NLTKLexicographical Analysis with NLTKAdditional Resources
Python Requests LibrarySubmitting a Basic FormRadio Buttons, Checkboxes, and Other InputsSubmitting Files and ImagesHandling Logins and CookiesHTTP Basic Access AuthenticationOther Form Problems
A Brief Introduction to JavaScriptCommon JavaScript LibrariesAjax and Dynamic HTMLExecuting JavaScript in Python with SeleniumHandling RedirectsA Final Note on JavaScript
Overview of LibrariesPillowTesseractNumPyProcessing Well-Formatted TextScraping Text from Images on WebsitesReading CAPTCHAs and Training TesseractTraining TesseractRetrieving CAPTCHAs and Submitting Solutions
A Note on EthicsLooking Like a HumanAdjust Your HeadersHandling CookiesTiming Is EverythingCommon Form Security FeaturesHidden Input Field ValuesAvoiding HoneypotsThe Human Checklist
An Introduction to TestingWhat Are Unit Tests?Python unittestTesting WikipediaTesting with SeleniumInteracting with the SiteUnittest or Selenium?
Why Use Remote Servers?Avoiding IP Address BlockingPortability and ExtensibilityTorPySocksRemote HostingRunning from a Website Hosting AccountRunning from the CloudAdditional ResourcesMoving Forward
Installation and “Hello, World!”
Trademarks, Copyrights, Patents, Oh My!Copyright LawTrespass to ChattelsThe Computer Fraud and Abuse Actrobots.txt and Terms of ServiceThree Web ScraperseBay versus Bidder’s Edge and Trespass to ChattelsUnited States v. Auernheimer and The Computer Fraud and Abuse ActField v. Google: Copyright and robots.txt

Content preview from Web Scraping with Python

Chapter 11. Image Processing and Text Recognition

From Google’s self-driving cars to vending machines that recognize counterfeit currency, machine vision is a huge field with far-reaching goals and implications. In this chapter, we will focus on one very small aspect of the field: text recognition, specifically how to recognize and use text-based images found online by using a variety of Python libraries.

Using an image in lieu of text is a common technique when you don’t want text to be found and read by bots. This is often seen on contact forms when an email address is partially or completely rendered as an image. Depending on how skillfully it is done, it might not even be noticeable to human viewers but bots have a very difficult time reading these images and the technique is enough to stop most spammers from acquiring your email address.

CAPTCHAs, of course, take advantage of the fact that users can read security images but most bots can’t. Some CAPTCHAs are more difficult than others, an issue we’ll tackle later in this book.

But CAPTCHAs aren’t the only place on the Web where scrapers need image-to-text translation assistance. Even in this day and age, many documents are simply scanned from hard copies and put on the Web, making these documents inaccessible as far as much of the Internet is concerned, although they are “hiding in plain sight.” Without image-to-text capabilities, the only way to make these documents accessible is for a human to type them up by hand—and ...