O'Reilly logo

Python Cookbook by David Ascher, Alex Martelli

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 12. Processing XML

Introduction

Credit: Paul Prescod, co-author of XML Handbook (Prentice Hall)

XML has become a central technology for all kinds of information exchange. Today, most new file formats that are invented are based on XML. Most new protocols are based upon XML. It simply isn’t possible to work with the emerging Internet infrastructure without supporting XML. Luckily, Python has had XML support since Version 2.0.

Python and XML are perfect complements for each other. XML is an open-standards way of exchanging information. Python is an open source language that processes the information. Python is strong at text processing and at handling complicated data structures. XML is text-based and is a way of exchanging complicated data structures.

That said, working with XML is not so seamless that it takes no effort. There is always somewhat of a mismatch between the needs of a particular programming language and a language-independent information representation. So there is often a requirement to write code that reads (deserializes or parses) and writes (serializes) XML.

Parsing XML can be done with code written purely in Python or with a module that is a C/Python mix. Python comes with the fast Expat parser written in C. This is what most XML applications use. Recipe 12.7 shows how to use Expat directly with its native API.

Although Expat is ubiquitous in the XML world, it is not the only parser available. There is an API called SAX that allows any XML parser to be plugged into a Python program, as anydbm allows any database to be plugged in. This API is demonstrated in recipes that check that an XML document is well-formed, extract text from a document, count the tags in a document, and do some minor tweaking of an XML document. These recipes should give you a good understanding of how SAX works.

Recipe 12.13 shows the generation of XML from lists. Those of you new to XML (and some with more experience) will think that the technique used is a little primitive. It just builds up strings using standard Python mechanisms instead of using a special XML-generation API. This is nothing to be ashamed of, however. For the vast majority of XML applications, no more sophisticated technique is required. Reading XML is much harder than writing it. Therefore, it makes sense to use specialized software (such as the Expat parser) for reading XML, but nothing special for writing it.

XML-RPC is a protocol built on top of XML for sending data structures from one program to another, typically across the Internet. XML-RPC allows programmers to completely hide the implementation languages of the two communicating components. Two components running on different operating systems, written in different languages, can communicate easily. XML-RPC is built into Python 2.2. This chapter does not deal with XML-RPC because, together with its alternatives (which include SOAP, another distributed-processing protocol that also relies on XML), XML-RPC is covered in Chapter 13.

The other recipes are a little bit more eclectic. For example, one shows how to extract information from an XML document in environments where performance is more important than correctness (e.g., an interactive editor). Another shows how to auto-detect the Unicode encoding that an XML document uses without parsing the document. Unicode is central to the definition of XML, so it helps to understand Python’s Unicode objects if you will be doing sophisticated work with XML.

The PyXML extension package has a variety of useful tools for working with XML in more advanced ways. It has a full implementation of the Document Object Model (DOM)—as opposed to the subset bundled with Python itself—and a validating XML parser written entirely in Python. The DOM is an API that loads an entire XML document into memory. This can make XML processing easier for complicated structures in which there are many references from one part of the document to another, or when you need to correlate (e.g., compare) more than one XML document. There is only one really simple recipe that shows how to normalize an XML document with the DOM (Recipe 12.9), but you’ll find many other examples in the PyXML package (http://pyxml.sourceforge.net/).

There are also two recipes that focus on XSLT: Recipe 12.5 shows how to drive two different XSLT engines, and Recipe 12.10 shows how to control XSLT stylesheet loading when using the XSLT engine that comes with the FourThought 4Suite package (http://www.4suite.org/). This package provides a sophisticated set of open source XML tools above and beyond those provided in core Python or in the PyXML package. In particular, this package has implementations of a variety of standards, such as XPath, XSLT, XLink, XPointer, and RDF. This is an excellent resource for XML power users.

For more information on using Python and XML together, see Python and XML by Christopher A. Jones and Fred L. Drake, Jr. (O’Reilly).

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required