Now that we’ve looked at how we can extract information from our documents using the DOM, we probably want to be able to change them. There are really just a few things we need to know to make changes, so we describe the basic operations and then show a few examples. The basic operations involved in modifying a document center around creating new nodes, adding, moving, and removing nodes, and modifying the contents of nodes. Since we often want to add new elements and textual content, we start by looking at creating new nodes.
Most of the time, new nodes need to be created explicitly. Since the DOM is defined as a set of interfaces rather than as concrete classes, the only way to create new nodes is to make call methods on the objects we already have in hand. Fortunately, the Document interface includes a large selection of factory methods we can use to create new nodes of most types. (Methods for creating entity and notation nodes are noticeably absent, but most applications should not find themselves constrained by that.)
The most used of these factory methods are very simple, and are
used to create new element and text nodes. For elements, use the
createElement
method, with the tag
name of the element to create as the only parameter. Text nodes can be
created using the createTextNode
method, passing the text of the new node as the parameter. For the
details on the other node factory methods, see the reference material
in Appendix D.
There are some very handy methods available for moving
nodes to different locations on the tree. These methods appear on the
basic Node
interface, so all DOM
nodes provide these. There are constraints on the use of these nodes:
you cannot use them to construct documents which do not make sense
structurally, and well-formedness of the document is ensured at all
times. For example, an exception is raised if you attempt to add a
child to a text node, or if you try to add a second child element to
the document object.
appendChild(
newChild
)
Takes a
newChild
node argument and appends it to the end of the list of children of the node.insertBefore(
newChild
,
refChild
)
Takes the node
newChild
and inserts it immediately before therefChild
node you supply.replaceChild(
newChild
,
oldChild
)
Replaces the
oldChild
with thenewChild
, andoldChild
is returned to the caller.removeChild(
oldChild
)
Removes the node
oldChild
from the list of children of the node this is called on.
The brief descriptions do not replace the reference documentation for these methods; see Appendix D for more complete information.
Let’s look at how to examine a tree, and how to remove specific nodes on the tree. Example 4-4 uses a few nested loops to dive three levels deep into an XML document created using the index.py script from Example 3-4. The design has its limitations, as it assumes you are only dealing with elements no more than three levels deep, but demonstrates the DOM methods we’re interested in.
Example 4-4. domit.py
#!/usr/bin/env python import sys from xml.dom.ext.reader.Sax2 import FromXmlStream from xml.dom.ext import PrettyPrint # get DOM object doc = FromXmlStream(sys.stdin) # remove unwanted nodes by traversing Node tree for node1 in doc.childNodes: for node2 in node1.childNodes: node3 = node2.firstChild while node3 is not None: next = node3.nextSibling name = node3.nodeName if name in ("contents", "extension", "userID", "groupID"): # remove unwanted nodes here via the parent node2.removeChild(node3) node3 = next PrettyPrint(doc)
After getting a document from standard input, a few nested
for
loops are executed to descend
three levels deep into the tree and look for specific tag names. When
running the script against the XML document we created with index.py, your file
elements should look like this:
<file name='c:\windows\desktop\G-Force\G-Force.doc'> <size>12570</size> <lastAccessed>Tue May 09 00:00:00 2000</lastAccessed> <lastModified>Tue May 09 11:56:14 2000</lastModified> <created>Wed Jan 17 23:31:23 2001</created> </file>
The whitespace around the removed elements remains in
place as you can see by the gaps between elements; we did not look for
adjacent text nodes, so they remain unaffected. This text was the
result of a call to the PrettyPrint
function at the end of the script. Of course, the element looks the
same regardless of hierarchical position within the document. When
writing DOM processing code, you should try to keep it independent
from the structure of the document. Instead of using firstChild
to get what you’re after,
consider enumerating the children and examining each one. This may
cost some processing time, but it does give the document’s structure
more flexibility. As long as the target element appears beneath the
parent node, the child will be found. When you use firstChild
, you might be setting yourself up
for trouble if someone gives you a document with a slightly different
structure, such as a peer element coming before another in the
document. You can write this type of operation using a recursive
function, so that you can handle similar structures, regardless of
position in the document. If you really don’t care where within the
subtree an element is found, you can use the getElementsByTagName
method described
earlier.
Another common requirement is to locate a node that you know must be a child of a particular node, but not require a specific ordering of the child nodes. A simple loop in a utility function handles this nicely:
from xml.dom import Node def findChildrenByTagName(parent, tagname): """Return a list of 'tagname' children of 'parent'.""" L = [] for child in parent.childNodes: if (child.nodeType == Node.ELEMENT_NODE and child.tagName == tagname): L.append(child) return L
An even simpler helper function that can come in handy is a function that finds the first child element with a particular tag name, or the first to have one of several tag names. These are all minor variations of the function just presented.
In addition to doing replacements and additions, you can also restructure a document entirely using the DOM.
In Example 4-5, we
take the nested loops from the last section, and replace them with a
traveling recursive function. The script can also work with XML output
from the index.py script we
worked with earlier in this chapter. In this version however, the
file
element’s size
child is used as a replacement for
itself. This process leaves the document filled with directory
and size
elements only.
Example 4-5 shows domit2.py using a recursive function.
Example 4-5. domit2.py
#!/usr/bin/env python from xml.dom.ext.reader.Sax2 import FromXmlStream from xml.dom.ext import PrettyPrint import sys def makeSize(nodeList): for subnode in nodeList: if subnode.nodeType == subnode.ELEMENT_NODE: if subnode.nodeName == "size": subnode.parentNode.parentNode.replaceChild( subnode, subnode.parentNode) else: makeSize(subnode.childNodes) # get DOM object doc = FromXmlStream(sys.stdin) # call func makeSize(doc.childNodes) # display altered document PrettyPrint(doc)
You can run the script from the command line:
$> python domit2.py < wd.xml
The file wd.xml is an XML file created with the index.py script—you can use any file you like, as long as has the same structure as the files created by index.py. The output should be something like this:
<Directory name='c:\windows\desktop\gl2'> <size>230444</size> <size>3035</size> <size>8904</size> <size>722</size> <Directory name='c:\windows\desktop\gl2/Debug'> <size>156672</size> <size>86016</size> <size>3779068</size> <size>25685</size> <size>17907</size> <size>250508</size> <size>208951</size> <size>402432</size> </Directory> <size>3509</size> <size>33792</size> <size>722</size> <size>48640</size> <size>533</size> </Directory>
Get Python & XML now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.