Reading an Article

In this example, we look at how we can extract and use information from an XML document using SAX. The particular documents our script works with are simple news articles, but we’ll see how to work with elements, attributes, and textual content.

Some of the trade-offs of using SAX depend on what you’re trying to accomplish, and how the XML is structured. SAX treats XML as a continuous stream, firing events to your handler as they happen. Example 3-1 shows article.xml.

Example 3-1. article.xml

<?xml version="1.0"?>
<webArticle category="news" subcategory="technical">
    <header title="NASA Builds Warp Drive"
           length="3k"
           author="Joe Reporter"
           distribution="all"/>
    <body>Seattle, WA - Today an anonymous individual
           announced that NASA has completed building a
           Warp Drive and has parked a ship that uses
           the drive in his back yard.  This individual
           claims that although he hasn't been contacted by
           NASA concerning the parked space vessel, he assumes
           that he will be launching it later this week to
           mount an expedition to the Andromeda Galaxy.
    </body>
</webArticle>

Example 3-1 contains markup that is structured in a few different ways, and can be interesting to parse via SAX. A document such as article.xml requires that we understand how the document is structured prior to writing a handler to parse it. Therefore, the handler is tightly coupled to the document’s structure.

Writing a Simple Handler

You can write the ArticleHandler class to a new file, handlers.py; we’ll keep adding ...

Get Python & XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.