Skip to Content
Getting Started with Pyparsing
book

Getting Started with Pyparsing

by Paul McGuire
October 2007
Intermediate to advanced
65 pages
1h 33m
English
O'Reilly Media, Inc.
Content preview from Getting Started with Pyparsing

Extracting Data from a Web Page

The Internet has become a vast source of freely available data no further than the browser window on your home computer. While some resources on the Web are formatted for easy consumption by computer programs, the majority of content is intended for human readers using a browser application, with formatting done using HTML markup tags.

Sometimes you have your own Python script that needs to use tabular or reference data from a web page. If the data has not already been converted to easily processed comma-separated values or some other digestible format, you will need to write a parser that "reads around" the HTML tags and gets the actual text data.

It is very common to see postings on Usenet from people trying to use regular expressions for this task. For instance, someone trying to extract image reference tags from a web page might try matching the tag pattern "<img src=quoted_string>". Unfortunately, since HTML tags can contain many optional attributes, and since web browsers are very forgiving in processing sloppy HTML tags, HTML retrieved from the wild can be full of surprises to the unwary web page scraper. Here are some typical "gotchas" when trying to find HTML tags:

Tags with extra whitespace or of varying upper-/lowercase

<img src="sphinx.jpeg">, <IMG SRC="sphinx.jpeg">, and <img src = "sphinx.jpeg" > are all equivalent tags.

Tags with unexpected attributes

The IMG tag will often contain optional attributes, such as align, alt, id, vspace, hspace ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Undocumented Secrets of MATLAB-Java Programming

Undocumented Secrets of MATLAB-Java Programming

Yair M. Altman
What Employees Want Most in Uncertain Times

What Employees Want Most in Uncertain Times

Kristine W. Powers, Jessica B.B. Diaz

Publisher Resources

ISBN: 9780596514235Errata