Retrieving information from a document is easy using the DOM. Most of the work lies in traversing the document tree and selecting the nodes that are actually interesting for the application. Once that is done, it is usually trivial to call a method of the node (or nodes), or to retrieve the value of an attribute of the node. In order to extract information using the DOM, however, we first need to get a DOM document object.
Perhaps the most glaring hole in the DOM specifications is that there is no facility in the API for retrieving a document object from an existing XML document. In a browser, the document is completely loaded before the DOM client code in the embedded or linked scripts can get to the document, so the document object is placed in a well-known location in the script’s execution environment. For applications that do not live in a web browser, this approach simply does not work, so we need another solution.
Our solution depends on the particular DOM implementation we use. We can always create a document object from a file, a string, or a URL.
Creating a DOM instance to work with is easy in Python. Using 4DOM, we need call only one function to load a document from an open file:
from xml.dom.ext.reader.Sax2 import FromXmlStream doc = FromXmlStream(sys.stdin)
There are two convenient functions in the xml.dom.minidom
module that can be used to load a document. The parse
function takes a parameter that can
be a string containing a filename or URL, or it can be a file object
open for reading:
import xml.dom.minidom doc = xml.dom.minidom.parse(sys.stdin)
Another function, parseString
, can be used to load a
document from a buffer containing XML text that has already been
loaded into memory:
doc = xml.dom.minidom.parseString("<doc>My tiny document.</doc>")
You can use the constants built in to the DOM to see what type of node you are dealing with. It may be an element, an attribute, a CDATA section, or a host of other things. (All the node type constants are listed in Appendix D.)
To test a node’s type, compare its nodeType
attribute to the particular
constant you’re looking for. For example, a CDATASection
instance has a nodeType
equal to CDATA_SECTION_NODE
. An Element
(with potential children) has a
nodeType
equal to ELEMENT_NODE
. When traversing a DOM tree,
you can test a node at any point to determine whether it is what
you’re looking for:
for node in nodes.childNodes: if node.nodeType == node.ELEMENT_NODE: print "Found it!"
The Node
interface has other
identifying properties, such as its value and
name. The nodeName
value represents the tag name for
elements, while in a text node the nodeName
is simply #text
. The nodeValue
attribute may be null for
elements, and should be the actual character data of a text element or
other leaf-type element.
When dealing with a DOM tree, you primarily use nodes and node lists. A node list is a collection of nodes. Any level of an XML document can be represented as a node list. Each node in the list can in turn contain other node lists, representing the potential for infinite complexity of an XML document.
The Node
interface features
two methods for quickly getting to a specific child node, as well as a
method to get a node list containing a node’s children. firstChild
refers to
the first child node of any given node. The interface shows None
if the node has no children. This is
handy when you know exactly the structure of the document you’re
dealing with. If you are working with a strict content model enforced
by a schema or DTD, you may be able to count on the fact that the
document is organized in a certain way (provided you included a
validation step). But for the most part, it’s best to leverage the
spirit of XML and actually traverse the document for the data you’re
looking for, rather than assume there is logic to the location of the
data. Regardless, firstChild
can be
very powerful, and is often used to retrieve the first element beneath
a document element.
The lastChild
attribute is similar to firstChild
,
but returns the last child node of any given node. Again, this can be
handy if you know the exact structure of the document you’re working
with, or if you’re trying to just get the last child regardless of the
significance of that child.
The childNodes
attribute contains a node list containing all the children of the
given node. This attribute is used frequently when working with the
DOM. When iterating over children of an element, the childNodes
attributes can be used for simple
iteration in the same way that you would iterate over a list:
for child in node.childNodes: print "Child:", child.nodeName
The value of the childNodes
attribute is a NodeList
object. For
the purpose of retrieving information from the DOM, it behaves like a
Python list, but does not support “slicing.” NodeList
objects should not be used to
modify the content of the DOM as the specific behaviors may differ
among DOM implementations.
The NodeList
interface features some additional interfaces beyond those provided by
lists. These are not commonly used with Python, but are available
since the DOM specifies their presence and behavior. The length
attribute indicates the number of
nodes in the list. Note that the length returns the total number, but
that indexing begins at zero. For example, a NodeList
with a length of 3 has nodes at
indices 0, 1, and 2 (which mirrors the way an array is normally
indexed in Python). Most Python programmers prefer to use the len
built-in function, which works properly
with NodeList
objects.
The item
method returns the
item at the specific index passed in as a parameter. For example,
item(1)
returns the second node in
the NodeList
, or None
if there are fewer than two nodes. This
is distinct from the Python indexing operation, for which a NodeList
raises IndexError
for an index that is out of
bounds.
Since XML documents are hierarchical and the DOM exposes
them as a tree, it is reasonable to want to get the siblings of a node
as well as its children. This is done using the previousSibling
and
nextSibling
attributes. If a node is the first child of its parent, its previousSibling
is None
; likewise, if it is the last child, its
nextSibling
is None
. If a node is the only child of its
parent, both of these attributes are None
, as expected.
When combined with the firstChild
or lastChild
attributes, the sibling attributes
can be used to iterate over an element’s children. The required code
is slightly more verbose, but is also better suited for use when the
document tree is being modified in certain ways, especially when nodes
are being added to or removed from the element whose children are
being iterated over.
For example, consider how Directory
elements could be removed from
another Directory
element to leave
us with a Directory
containing only
files. If we iterate over the top element using its childNodes
attribute and remove child
Directory
elements as we see them,
some nodes are not properly examined. (This happens because Python’s
for
loops use the index into the
list, but we’re also shifting remaining children to the left when we
remove one, so it is skipped as the loop advances.) There are many
ways to avoid skipping elements, but perhaps the simplest is to use
nextSibling
to iterate:
child = node.firstChild while child is not None: next = child.nextSibling if (child.nodeType == node.ELEMENT_NODE and child.tagName == "Directory"): node.removeChild(child) child = next
The DOM can provide some advantages over SAX, depending
on what you’re trying to do. For starters, when using the DOM, you
don’t have to write a separate handler for each type of event, or set
flags to group events together as was done earlier with SAX in Example 3-3. Imagine that you
have a long record of purchase orders stacked up in XML. Someone has
approached you about pulling part numbers, and only part numbers, out
of the document for reporting purposes. With SAX, you can write a
handler to look for elements with the name used to identify part
numbers (sku
in the example), and
then set a flag to gobble up character events until the parser leaves
the part number element. With the DOM, you have a different approach
using the getElementsByTagName
method of the Document interface.
To show how easy this can make some operations, let’s look at a simple example. Create a new XML file as shown in Example 4-1, po.xml. This document is the sample purchase order for the next script:
Example 4-1. po.xml
<?xml version="1.0"?> <purchaseOrder> <item> <name>Mushroom Lamp</name> <sku>229-987488</sku> <price>$34.99</price> <qty>1</qty> </item> <item> <name>Bass Drum</name> <sku>228-988347</sku> <price>$199.99</price> <qty>1</qty> </item> <item> <name>Toy Steam Engine</name> <sku>221-388833</sku> <price>$19.99</price> <qty>1</qty> </item> </purchaseOrder>
Using the DOM, you can easily create a list of nodes
that references all nodes of a single element type within the
document. For example, you could pull all of the sku
elements from the document into a new
list of nodes. This list can be used like any other NodeList
object, with the difference that
the nodes in the list may not share a single parent, as is the case
with the childNodes
value. Since
the DOM works with the structural tree of the XML document, it is able
to provide a simple method call to pull a subset of the document out
into a separate node list. In Example 4-2, the getElementsByTagName
method is used to
create a single NodeList
of all the
sku
elements within the document.
Our example shows that sku
elements
have text nodes as children, but we know that a string of text in the
document may be presented in the DOM as multiple text nodes. To make
the tree easier to work with, you can use the normalize
method of the Node
interface to convert all adjacent text
nodes into a single text node, making it easy to use the firstChild
attribute of the Element
class to retrieve the complete text
value of the sku
elements
reliably.
Example 4-2. po.py
#!/usr/bin/env python from xml.dom.ext.reader.Sax2 import FromXmlStream import sys doc = FromXmlStream(sys.stdin) for sku in doc.getElementsByTagName("sku"): sku.normalize( ) print "Sku: " + sku.firstChild.data
Example 4-2 requires considerably less code than what is required if you are implementing a SAX handler for the same task. The extraction can operate independently of other tasks that work with the document. When you run the program, again using po.xml, you receive something similar to the following on standard output:
Sku: 229-987488 Sku: 228-988347 Sku: 221-388833
You can see something similar being done using SAX in Example 3-3.
Let’s look at a program that puts many of these concepts together, and uses the article.xml file from the previous chapter (Example 3-1). Example 4-3 shows a recursive function used to extract text from a document’s elements.
Example 4-3. textme.py
#!/usr/bin/env python from xml.dom.ext.reader.Sax2 import FromXmlStream import sys def findTextNodes(nodeList): for subnode in nodeList: if subnode.nodeType == subnode.ELEMENT_NODE: print "element node: " + subnode.tagName # call function again to get children findTextNodes(subnode.childNodes) elif subnode.nodeType == subnode.TEXT_NODE: print "text node: ", print subnode.data doc = FromXmlStream(sys.stdin) findTextNodes(doc.childNodes)
You can run this script passing article.xml as standard input:
$> python textme.py < article.xml
It should produce output similar to the following:
element node: webArticle text node: element node: header text node: element node: body text node: Seattle, WA - Today an anonymous individual announced that NASA has completed building a Warp Drive and has parked a ship that uses the drive in his back yard. This individual claims that although he hasn't been contacted by NASA concerning the parked space vessel, he assumes that he will be launching it later this week to mount an exhibition to the Andromeda Galaxy. text node:
You can see in the output how whitespace is treated as its own text node, and how contiguous strings of character data are kept together as text nodes as well. The exact output you see may vary from that presented here. Depending on the specific parser you use (consider different versions or different platforms as different parsers since the buffering interactions with the operating system can be relevant), the specific boundaries of text nodes may differ, and you may see contiguous blocks of character data presented as more than one text node.
Now that we’ve seen how to examine the hierarchical content of an XML document using the DOM, we need to take a look at how we can use the DOM to retrieve XML’s only nonhierarchical component: attributes. As with all other information in the DOM, attributes are described as nodes. Attribute nodes have a very special relationship with the tree structure of an XML document; we find that the interfaces that allow us to work with them are different as well.
When we looked at the child nodes of elements earlier (as in
Example 4-3), we only saw
nodes for child elements and textual data. From this, we can
reasonably surmise that attributes are not children of the element on
which they are included. They are available, however, using some
methods specific to Element
nodes.
There is an attribute of the Node
interface that is used only for attributes of elements.
The easiest way to get the value of an attribute is to
use the getAttribute
method of the
element node. This method takes the name of the attribute as a string
and returns a string giving the value of the attribute, or an empty
string if the attribute is not present. To retrieve the node object
for the attribute, use the getAttributeNode
method instead; if the
attribute does not exist, it returns None
. If you need to test for the presence
of an attribute without retrieving the node or attribute value, the
hasAttribute
method will prove
useful.
Another way to look at attributes is using a structure called a
NamedNodeMap
. This object is
similar in function to a dictionary, and the Python version of this
structure shares much of the interface of a dictionary. The Node
interface includes an attribute named
attributes
that is only used for
element nodes; it is always set to None
for other node types. While the
NamedNodeMap
supports the item
method and length
attribute much as the NodeList
interface does, the normal way of
using it in Python is as a mapping object, which supports most of the
interfaces provided by dictionary objects. The keys are the attribute
names and the values are the attribute nodes.
Get Python & XML now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.