Parsing an XML File with xml.parsers.expat

Credit: Mark Nenadov

Problem

The expat parser is normally used through the SAX interface, but sometimes you may want to use expat directly to extract the best possible performance.

Solution

Python is very explicit about the lower-level mechanisms that its higher-level modules’ packages use. You’re normally better off accessing the higher levels, but sometimes, in the last few stages of an optimization quest, or just to gain better understanding of what, exactly, is going on, you may want to access the lower levels directly from your code. For example, here is how you can use Expat directly, rather than through SAX:

import xml.parsers.expat, sys

class MyXML:
    Parser = ""

    # Prepare for parsing
    def _ _init_ _(self, xml_filename):
        assert xml_filename != ""
        self.xml_filename = xml_filename
        self.Parser = xml.parsers.expat.ParserCreate(  )

        self.Parser.CharacterDataHandler = self.handleCharData
        self.Parser.StartElementHandler = self.handleStartElement
        self.Parser.EndElementHandler = self.handleEndElement

    # Parse the XML file
    def parse(self):
        try:
            xml_file = open(self.xml_filename, "r")
        except:
            print "ERROR: Can't open XML file %s"%self.xml_filename
            raise
        else:
            try: self.Parser.ParseFile(xml_file)
            finally: xml_file.close(  )

    # to be overridden by implementation-specific methods
    def handleCharData(self, data): pass
    def handleStartElement(self, name, attrs): pass
    def handleEndElement(self, name): pass

Discussion

This recipe presents a reusable way to use xml.parsers.expat directly to parse an XML file. SAX is more standardized and rich in functionality, but expat is also usable, and sometimes it can be even lighter than the already lightweight SAX approach. To reuse the MyXML class, all you need to do is define a new class, inheriting from MyXML. Inside your new class, override the inherited XML handler methods, and you’re ready to go.

Specifically, the MyXML class creates a parser object that does callbacks to the callables that are its attributes. The StartElementHandler callable is called at the start of each element, with the tag name and the attributes as arguments. EndElementHandler is called at the end of each element, with the tag name as the only argument. Finally, CharacterDataHandler is called for each text string the parser encounters, with the string as the only argument. The MyXML class uses the handleStartElement, handleEndElement, and handleCharData methods as such callbacks. Therefore, these are the methods you should override when you subclass MyXML to perform whatever application-specific processing you require.

See Also

Recipe 12.2, Recipe 12.3, Recipe 12.4, and Recipe 12.6 for uses of the higher-level SAX API; while Expat was the brainchild of James Clark, Expat 2.0 is a group project, with a home page at http://expat.sourceforge.net/.

Get Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.