The Power of Python and XML

Now that we’ve introduced you to the world of XML, we’ll look at what Python brings to the table. We’ll review the Python features that apply to XML, and then we’ll give some specific examples of Python with XML. As a very high-level language, Python includes many powerful data structures as part of the core language and libraries. The more recent versions of Python, from 2.0 onward, include excellent support for Unicode and an impressive range of encodings, as well as an excellent (and fast!) XML parser that provides character data from XML as Unicode strings. Python’s standard library also contains implementations of the industry-standard DOM and SAX interfaces for working with XML data, and additional support for alternate parsers and interfaces is available.

Of course, this much could be said of other modern high-level languages as well. Java certainly includes an impressive library of highly usable data structures, and Perl offers equivalent data structures also. What makes Python preferable to those languages and their libraries? There are several features, of which we briefly discuss the most important:

  • Python source code is easy to read and maintain.

  • The interactive interpreter makes it simple to try out code fragments.

  • Python is incredibly portable, but does not restrict access to platform-specific capabilities.

  • The object-oriented features are powerful without being obscure.

There are many languages capable of doing what can be done with Python, but it is rare to find all of the “peripheral” qualities of Python in any single language. These qualities do not so much make Python more capable, but they make it much easier to apply, reducing programming hours. This allows more time to be spent finding better ways to solve real problems or just allows the programmer to move on to the next problem. Here we discuss these features in more detail.

Easy to read and maintain

As a programming language, Python exhibits a remarkable clarity of expression. Though some programmers accustomed to other languages view Python’s use of significant whitespace with surprise, everyone seems to think it makes Python source code significantly more readable than languages that require more special characters to be introduced to mark structure in the source. Python’s structures are not simpler than those of other languages, but the different syntax makes source code “feel” much cleaner in Python.

The use of whitespace also helps avoid having minor stylistic differences, such as the placement of structural braces, so there’s a greater degree of visual consistency across code by different programmers. While this may seem like a minor thing to many programmers, the effect is that maintaining code written by another programmer becomes much easier simply because its easier to concentrate on the actual structure and algorithms of the code. For the individual programmer, this is a nice side benefit, but for a business, this results in lower expenses for code maintenance.

Exploratory programming in an interactive interpreter

Many modern high-level programming languages offer interpreters, but few have proved as successful at doing so as Python. Others, such as Java, do not generally offer interpreters at all. If we consider Perl, a language that is arguably very capable when used from a command line, we see that it is not equipped with a rich interpreter. If we start the Perl interpreter without naming a script, it simply waits for us to type a complete script at the console, and then interprets the script when we’re done. It does allow us to enter a few commands on the command line directly, but there’s no ability to run one statement at a time and inspect the results as we go in order to determine if each bit of code is doing exactly what we expect. With Python, the interactive interpreter provides a rich environment for executing individual statements and testing the results.

Portability without restrictions

The Python interpreter is one of the most portable language interpreters available. It is known to run on platforms ranging from PDAs and other embedded systems to some of the most powerful multiprocessor platforms ever built. It can run on more operating systems than perhaps any other interpreter. Moreover, carefully written application code can share much of this portability. Python provides a great array of abstractions that do just enough to hide platform differences while allowing the programmer to use the services of specific platforms when necessary.

When an application requires access to facilities or libraries that Python does not provide, Python also makes it easy to add extensions that take advantage of these additional facilities. Additional modules can be created (usually in C or C++, but other languages can be used as well) that allow Python code to call on external facilities efficiently.

Powerful but accessible object-orientation

At one time, it was common to hear about how object-oriented programming (OOP) would solve most of the technical problems programmers had to deal with in their code. Of course, programmers knew better, pushed back, and turned the concepts into useful tools that could be applied when appropriate (though how and when it should be applied may always be the subject of debate). Unfortunately, many languages that have strong support for OOP are either very tedious to work with (such as C++ or, to a lesser extent, Java), or they have not been as widely accepted for general use (such as Eiffel).

Python is different. The language supports object orientation without much of the syntactic overhead found in many widely used object-oriented languages, making it very easy to define new object types. Unlike many other languages, Python is highly polymorphic; interfaces are defined in much less stringent ways than in languages such as C++ and Java. This makes it easy to create useful objects without having to write code that exists only to conform to an interface, but that will not actually be used in a particular application. When combined with the excellent advantage taken by Python’s standard library of a variety of common interfaces, the value of creating reusable objects is easily recognized, all while the ease of implementing useful interfaces is maintained.

Python Tools for XML

Three major packages provide Python tools for working with XML. These are, from the most commonly used to the largest:

  1. The Python standard library

  2. PyXML, produced by the Python XML Special Interest Group

  3. 4Suite, provided by Fourthought, Inc.

The Python standard library provides a minimal but useful set of interfaces to work with XML, including an interface to the popular Expat XML parser, an implementation of the lightweight Simple API for XML (SAX), and a basic implementation of the core Document Object Model (DOM). The DOM implementation supports Level 1 and much of Level 2 of the DOM specification from the W3C, but does not implement most of the optional features. The material in the standard library was drawn from material originally in the PyXML package, and additional material was contributed by leading Python XML developers.

PyXML is a more feature-laden package; it extends the standard library with additional XML parsers, has a much more substantial DOM implementation (including more optional features), has adapters to allow more parsers to support the SAX interface, XPath expression parsing and evaluation, XSLT transformations, and a variety of other helper modules. The package is maintained as a community effort by many of the most active Python/XML programmers.

4Suite is not a superset of the other packages, but is intended to be used in addition to PyXML. It offers additional DOM implementations tailored for different applications, support for the XLink and XPointer specifications, and tools for working with Resource Description Framework (RDF) data.

These are the packages used throughout the book; see Appendix A for more information on obtaining and installing them. Still more are available; see Appendix F for brief descriptions of several of these and references to more information online.

The SAX and DOM APIs

The two most basic and broadly used APIs to XML data are the SAX and DOM interfaces. These interfaces differ substantially; learning to determine which of these is appropriate for your application is an important step to learn.

SAX defines a relatively low-level interface that is easy for XML parsers to support, but requires the application programmer to manage more details of using the information in the XML documents and performing operations on it. It offers the advantage of low overhead: no large data structures are constructed unless the application itself actually needs them. This allows many forms of processing to proceed much more quickly than could occur if more overhead were required, and much larger documents can be processed efficiently. It achieves this by being an event-oriented interface; using SAX is more like processing user-input events in a graphical user interface than manipulating a pre-constructed data structure. So how do you get “events” from an XML parser, and what kind of events might there be?

SAX defines a number of handler interfaces that your application can implement to receive events. The methods of these objects are called when the appropriate events are encountered in the XML document being parsed; each method can be thought of as the actual event, which fits well with object-oriented approaches to parsing. Events are categorized as content, document type, lexical, and error events; each category of events is handled using a distinct interface. The application can specify exactly which categories of events it is interested in receiving by providing the parser with the appropriate handlers and omitting those it does not need. Python’s XML support provides base classes that allow you to implement only the methods you’re interested in, just inheriting do-nothing methods for events you don’t need.

The most commonly used events are the content-related events, of which the most important are startElement, characters, and endElement. We look at SAX in depth in Chapter 3, but now let’s take a quick look at how we might use SAX to extract some useful information from a document. We’ll use a simple document; it’s easy to see how this would extend to something more complex. The document is shown here:

<catalog>
  <book isbn="1-56592-724-9">
    <title>The Cathedral &amp; the Bazaar</title>
    <author>Eric S. Raymond</author>
  </book>
  <book isbn="1-56592-051-1">
    <title>Making TeX Work</title>
    <author>Norman Walsh</author>
  </book>
  <!-- imagine more entries here... -->
</catalog>

If we want to create a dictionary that maps the ISBN numbers given in the isbn attribute of the book elements to the titles of the books (the content of the title elements), we would create a content handler (as shown in Example 1-1) that looks at the three events listed previously.

Example 1-1. bookhandler.py

import xml.sax.handler

class BookHandler(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.inTitle = 0
    self.mapping = {}

  def startElement(self, name, attributes):
    if name == "book":
      self.buffer = ""
      self.isbn = attributes["isbn"]
    elif name == "title":
      self.inTitle = 1

  def characters(self, data):
    if self.inTitle:
      self.buffer += data

  def endElement(self, name):
    if name == "title":
      self.inTitle = 0
      self.mapping[self.isbn] = self.buffer

Extracting the information we’re looking for is now trivial. If the code above is in bookhandler.py and our sample document is in books.xml, we could do this in an interactive session:

>>> import xml.sax
>>> import bookhandler
>>> import pprint
>>> 
>>> parser = xml.sax.make_parser(  )
>>> handler = bookhandler.BookHandler(  )
>>> parser.setContentHandler(handler)
>>> parser.parse("books.xml")
>>> pprint.pprint(handler.mapping)
{u'1-56592-051-1': u'Making TeX Work',
 u'1-56592-724-9': u'The Cathedral & the Bazaar'}

For reference material on the handler object methods, refer to Appendix C.

The DOM is quite the opposite of SAX. SAX offers a very small window of view that passes over the input document, relying on the application to infer the whole; the DOM gives the whole document to the application, which must then extract the finer details for itself. Instead of reporting individual events to the application as the parser handles the corresponding syntax in the document, the application creates an object that represents the entire document as a hierarchical structure. Although there is no requirement that the document be completely parsed and stored in memory when the object is provided to the application, most implementations work that way for simplicity. Some implementations avoid this; it is certainly possible to create a DOM implementation that parses the document lazily or uses some kind of persistent storage to keep the parsed document instead of an in-memory structure.

The DOM provides objects called nodes that represent parts of a document to the application. There are several types of nodes, each used for a different kind of construct. It is important to understand that the nodes of the DOM do not directly correspond to SAX events, although many are similar. The easiest way to see the difference is to look at how elements and their content are represented in both APIs. In SAX, an element is represented by start and end events, and its content is represented by all the events that come between the start and the end. The DOM provides a single object that represents the element, and it provides methods that allow the application to get the child nodes that represent the content of the element. Different node types are provided for elements, text, and just about everything else that can exist in an XML document.

We go into more detail and see some extended examples using the DOM in Chapter 4, and a detailed reference to the DOM API is given in Appendix D. For a quick taste of the DOM, let’s write a snippet of code that does the same thing we do with SAX in Example 1-1, but using the basic DOM implementation from the Python standard library, as shown in Example 1-2.

Example 1-2. dombook.py

import pprint

import xml.dom.minidom
from xml.dom.minidom import Node

doc = xml.dom.minidom.parse("books.xml")

mapping = {}

for node in doc.getElementsByTagName("book"):
  isbn = node.getAttribute("isbn")
  L = node.getElementsByTagName("title")
  for node2 in L:
    title = ""
    for node3 in node2.childNodes:
      if node3.nodeType == Node.TEXT_NODE:
        title += node3.data
    mapping[isbn] = title

# mapping now has the same value as in the SAX example:
pprint.pprint(mapping)

It should be clear that we’re dealing with something very different here! While there’s about the same amount of code in the DOM example, it can be very difficult to develop reusable components, while experience with SAX often points the way to reusable components with only a small bit of refactoring. It is possible to reuse DOM code, but the mindset required is very different. What the DOM provides to compensate is that a document can be manipulated at arbitrary locations with full knowledge of the complete document, and the document contents can be extracted in different ways by different parts of an application without having to parse the document more than once. For some applications, this proves to be a highly motivating reason to use the DOM instead of SAX.

More Ways to Extract Information

SAX and the DOM give us some powerful tools for working with XML, but they clearly require a lot of code and attention to detail to use effectively in a large application. In both cases, working with complex data requires a great deal of work just to extract the interesting bits from the XML documents that contain the data. Now, what sorts of tools would we normally turn to when dealing with complex data sets? Two that come to mind are higher-level abstractions (such as APIs that do more work, and specialized task-oriented languages), and preprocessing techniques (transforming data from one form to another more suitable to the task at hand). Fortunately, both of these are available to us when working with XML from Python.

When an XML user wants to specify a portion of a document based on possibly complex criteria, she uses a language which lets her write the specification concisely; that language is called the XML Path Language, or XPath. Support for XPath is available in the 4Suite package, and has recently been added to the PyXML package as well. Using XPath, a query can be written that selects nodes from a DOM tree based on the element names, attribute values, textual content, and relationships between the nodes. We cover XPath in some detail, including how to use it with a DOM tree in Python, in Chapter 5.

Other times, what we’d really like is a new document that either contains less information or arranges it very differently. For this, we need a way to specify a transformation of a document that generates another document. This is provided by XML Stylesheet Language Transformations (XSLT). Originally developed as part of a new specification for stylesheets, XSLT is an XML-based language that is used to define transformations from XML to other formats. XSLT is most commonly used with XML or HTML as the output format. Chapter 6 describes this language and shows how to use it in Python.

Get Python & XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.