Chapter 12. Advanced Web Scraping: Screen Scrapers and Spiders

You’ve begun your web scraping skills development, learning how to decipher what, how, and where to scrape in Chapter 11. In this chapter, we’ll take a look at more advanced scrapers, like browser-based scrapers and spiders to gather content.

We’ll also learn about debugging common problems with advanced web scraping and cover some of the ethical questions presented when scraping the Web. To begin, we’ll investigate browser-based web scraping: using a browser directly with Python to scrape content from the Web.

Browser-Based Parsing

Sometimes a site uses a lot of JavaScript or other post-page-load code to populate the pages with content. In these cases, it’s almost impossible to use a normal web scraper to analyze the site. What you’ll end up with is a very empty-looking page. You’ll have the same problem if you want to interact with pages (i.e., if you need to click on a button or enter some search text). In either situation, you’ll want to figure out how to screen read the page. Screen readers work by using a browser, opening the page, and reading and interacting with the page after it loads in the browser.

Tip

Screen readers are great for tasks performed by walking through a series of actions to get information. For this very reason, screen reader scripts are also an easy way to automate routine web tasks.

The most commonly used screen reading library in Python is Selenium. Selenium is a Java ...

Get Data Wrangling with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.