Chapter 5. Advanced HTML Parsing

When Michelangelo was asked how he could sculpt a work of art as masterful as his David, he is famously reported to have said, “It is easy. You just chip away the stone that doesn’t look like David.”

Although web scraping is unlike marble sculpting in most other respects, you must take a similar attitude when it comes to extracting the information you’re seeking from complicated web pages. In this chapter, we’ll explore various techniques to chip away any content that doesn’t look like content you want, until you arrive at the information you’re seeking. Complicated HTML pages may be look intimidating at first, but just keep chipping!

Another Serving of BeautifulSoup

In Chapter 4, you took a quick look at installing and running BeautifulSoup, as well as selecting objects one at a time. In this section, we’ll discuss searching for tags by attributes, working with lists of tags, and navigating parse trees.

Nearly every website you encounter contains stylesheets. Stylesheets are created so that web browsers can render HTML into colorful and aesthetically pleasing designs for humans. You might think of this styling layer as, at the very least, perfectly ignorable for web scrapers—but not so fast! CSS is, in fact, a huge boon for web scrapers because it requires the differentiation of HTML elements in order to style them differently.

CSS provides an incentive for web developers to add tags to HTML elements they might have otherwise left with the exact ...

Get Web Scraping with Python, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.