In this example, we look at how we can extract and use information from an XML document using SAX. The particular documents our script works with are simple news articles, but we’ll see how to work with elements, attributes, and textual content.
Some of the trade-offs of using SAX depend on what you’re trying to accomplish, and how the XML is structured. SAX treats XML as a continuous stream, firing events to your handler as they happen. Example 3-1 shows article.xml.
<?xml version="1.0"?> <webArticle category="news" subcategory="technical"> <header title="NASA Builds Warp Drive" length="3k" author="Joe Reporter" distribution="all"/> <body>Seattle, WA - Today an anonymous individual announced that NASA has completed building a Warp Drive and has parked a ship that uses the drive in his back yard. This individual claims that although he hasn't been contacted by NASA concerning the parked space vessel, he assumes that he will be launching it later this week to mount an expedition to the Andromeda Galaxy. </body> </webArticle>
Example 3-1 contains markup that is structured in a few different ways, and can be interesting to parse via SAX. A document such as article.xml requires that we understand how the document is structured prior to writing a handler to parse it. Therefore, the handler is tightly coupled to the document’s structure.
You can write the
ArticleHandler class to a new file,
handlers.py; we’ll keep adding ...