Searching the Document

When we’re scraping a web page, we’re generally interested in a small part of it. But a document contains so much information that we need some way of telling Nokogiri which particular bit of the page we’re interested in. As humans, we do this in a visual way. We might look at a page and see a table, grasping its contents from the title above it. We’d scan down the rows to see the particular record we’re interested in, and then across the columns to find the particular data value we were looking for. At no point do we have anything more than a vague appreciation for the structure of the page we’re viewing; it doesn’t matter to us what how the document is represented in HTML.

But, predictably, that’s not how a computer ...

Get Text Processing with Ruby now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.