BUY THIS BOOK
Add to Cart

Print Book $24.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £17.50

What is this?

Looking to Reprint this content?


XML Hacks
XML Hacks 100 Industrial-Strength Tips and Tools By Michael Fitzgerald
July 2004
Pages: 479

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Looking at XML Documents
Just because you can find XML in any nook and cranny you find software these days doesn't mean that everyone is an expert on the subject. That's why the hacks in this chapter were written: they are for readers who are just getting up to speed with XML. If that's you, read on; if that's not you, you can skip ahead to Chapter 2.
These hacks introduce you to the basics of XML: what an ordinary XML document looks like [Hack #1] , how to display an XML document in a variety of browsers [Hack #2] , how to style an XML document with CSS [Hack #3] , how to use character and entity references [Hack #4] , how to check an XML document for errors, both online [Hack #8] and on a command line [Hack #9] , and how to run Java programs that process XML [Hack #10] .
All the files mentioned in this chapter are in the book's file archive, downloadable from http://www.oreilly.com/catalog/xmlhks/. These hacks assume that you have extracted this archive into a working directory where you can exercise the examples.
Before you can do much with an XML document, you need to understand its basic parts. This hack explores the most common struchture found in XML
This hack lays the basic groundwork for XML: what it looks like and how it's put together. Example 1-1 shows a simple document (start.xml) that contains some of the most common XML structures: an XML declaration, a comment, elements, attributes, an empty element, and a character reference. start.xml is well-formed, meaning that it conforms to the syntax rules in the XML specification. XML documents must be well-formed.
Example 1-1. start.xml
1. <?xml version="1.0" encoding="UTF-8"?>
2. 
3. <!-- a time instant -->
4. <time timezone="PST">
5. <hour>11</hour>
6. <minute>59</minute>
7. <second>59</second>
8. <meridiem>p.m.</meridiem>
9. <atomic signal="true" symbol="&#x25D1;"/>
10. </time>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hacks #1-10
Just because you can find XML in any nook and cranny you find software these days doesn't mean that everyone is an expert on the subject. That's why the hacks in this chapter were written: they are for readers who are just getting up to speed with XML. If that's you, read on; if that's not you, you can skip ahead to Chapter 2.
These hacks introduce you to the basics of XML: what an ordinary XML document looks like [Hack #1] , how to display an XML document in a variety of browsers [Hack #2] , how to style an XML document with CSS [Hack #3] , how to use character and entity references [Hack #4] , how to check an XML document for errors, both online [Hack #8] and on a command line [Hack #9] , and how to run Java programs that process XML [Hack #10] .
All the files mentioned in this chapter are in the book's file archive, downloadable from http://www.oreilly.com/catalog/xmlhks/. These hacks assume that you have extracted this archive into a working directory where you can exercise the examples.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Read an XML Document
Before you can do much with an XML document, you need to understand its basic parts. This hack explores the most common struchture found in XML
This hack lays the basic groundwork for XML: what it looks like and how it's put together. Example 1-1 shows a simple document (start.xml) that contains some of the most common XML structures: an XML declaration, a comment, elements, attributes, an empty element, and a character reference. start.xml is well-formed, meaning that it conforms to the syntax rules in the XML specification. XML documents must be well-formed.
Example 1-1. start.xml
1. <?xml version="1.0" encoding="UTF-8"?>
2. 
3. <!-- a time instant -->
4. <time timezone="PST">
5. <hour>11</hour>
6. <minute>59</minute>
7. <second>59</second>
8. <meridiem>p.m.</meridiem>
9. <atomic signal="true" symbol="&#x25D1;"/>
10. </time>
The first line of the example contains an XML declaration, which is recommended by the XML spec but is not mandatory. If present, it must appear on the first line of the document. It is a human- and machine-readable flag that states a few facts about the content of the document.
An XML declaration is not a processing instruction, although it looks like one. Processing instructions are discussed in [Hack #3] .
In general, an XML declaration provides three pieces of information about the document that contains it: the XML version information; the character encoding in use; and whether the document stands alone or relies on information from an external source.

Section 1.2.1.1: Version information

If you use an XML declaration, it must include version information (as in version="1.0"). Currently, XML Version 1.0 is in the broadest use, but Version 1.1 is also now available (http://www.w3.org/TR/xml11/), so 1.1 is also a possible value for
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Display an XML Document in a Web Browser
The most popular web browsers can display and process XML natively. Nowadays, it's just a matter of opening a file.
XML is now mature enough that recent versions of the more popular web browsers support it natively. At the time of writing, the most recent versions of these browsers include:
  • Microsoft Internet Explorer 6 (http://www.microsoft.com/windows/ie/)
  • Mozilla 1.7 and Mozilla Firefox 0.9 (http://www.mozilla.org)
  • Netscape 7.1 (http://channels.netscape.com/ns/browsers/download.jsp)
  • Opera 7.51 (http://www.opera.com)
  • Apple's Safari 1.2 (http://www.apple.com/safari/)
This means that you can display raw, unstyled XML documents (files) directly in web browsers, with varying results.
The browsers use their own internal mechanisms to display XML. Internet Explorer (IE), for example, uses the default stylesheet defaultss.xsl , which is stored in a MSXML dynamic link library (DLL)—msxml.dll, msxml2.dll, or msxml3.dll. You can examine this stylesheet in IE by entering res://msxml3.dll/DEFAULTSS.xsl in the address bar. (This works for msxml.dll, msxml2.dll, or msxml3.dll, but not msxsml4.dll, the latest version.) If you have Visual Studio (http://msdn.microsoft.com/vstudio/), you can use the Resource Editor to edit and save this stylesheet back in the DLL (http://netcrucible.com/xslt/msxml-faq.htm#Q19).
To open an XML document such as time.xml (similar to start.xml), go to File Open File or File Open, depending on the browser, and select the document.
Figures Figure 1-1, Figure 1-2, Figure 1-3, and Figure 1-4 show time.xml displayed in IE, Mozilla, Opera, and Safari, respectively. (Mozilla, Firefox, and Netscape have very similar output, so only Mozilla is shown in Figure 1-2. All three of these browsers do not show the XML declaration of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Apply Style to an XML Document with CSS
Make an in-browser XML document more appealing by applying a CSS stylesheet to it.
Cascading Style Sheets (CSS) is a W3C language for applying style to HTML, XHTML, or XML documents (http://www.w3.org/Style/CSS/). CSS Level 1 or CSS/1 (http://www.w3.org/TR/CSS1) came out of the W3C in 1996 and was later revised in 1999. CSS Level 2 or CSS/2 (http://www.w3.org/TR/CSS2/) became a W3C recommendation in 1998. CSS/3 is under construction (http://www.w3.org/Style/CSS/current-work). Understandably, CSS/1 enjoys the widest support.
To apply CSS to an XML document, you must use the XML stylesheet processing instruction, which is based on another recommendation of the W3C (http://www.w3.org/TR/xml-stylesheet). The XML stylesheet processing instruction is optional unless you are using a stylesheet that you want to associate with an XML document in a standard way.
A processing instruction (PI) is a structure in an XML document that contains an instruction to an application (http://www.w3.org/TR/REC-xml#sec-pi). Generally, PIs can appear anywhere that an element can appear, although the XML stylesheet PI must appear at the beginning of an XML document (though after the XML declaration, if one is present). The beginning part of an XML document, before the document element begins, is called a prolog.
Here is an example of a PI:
<?xml-stylesheet href="time.css" type="text/css"?>
A PI is bounded by <? and ?>. The term immediately following <? is called the target . The target identifies the purpose or name of the PI. Other than the XML stylesheet PI, you can find PIs used in DocBook files [Hack #62] and in XML-format files used by Microsoft Office 2003 applications, such as Word [Hack #14] and Excel [Hack #15] .
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Use Character and Entity References
Not all characters are available on the keyboard! This hack shows you how to represent such characters in an XML document by using decimal and hexadecimal character references, and how to represent entities by using entity references.
In XML, character and entity references are formed by surrounding a numerical value or a name with & and ;—for example, &#169; is a decimal character reference and &copy; is an entity reference. This hack shows you how to use both.
According to the third and latest edition of the XML 1.0 specification (http://www.w3.org/TR/REC-xml/), XML processors must accept over 1,000,000 hexadecimal characters (http://www.w3.org/TR/REC-xml/#charsets). It's possible that you won't be able to find all those characters on your keyboard! Don't worry. You can use character references instead.
You can look up the semantics of individual Unicode characters at http://www.unicode.org/charts/.
You can reference characters using either decimal or hexadecimal numbers. Which one you use is a matter of style. The document Namen.xml uses both (Example 1-5); it contains some German names enclosed in German language tags.
Example 1-5. Namen.xml
<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet href="Namen.css" type="text/css"?>

   

<Namen xml:lang="de">

<Name>

 <Vorname>Marie</Vorname>

 <Nachname>M&#252;ller</Nachname>

 <Geschlecht>&#9792;</Geschlecht>

</Name>

<Name>

 <Vorname>Klaus</Vorname>

 <Nachname>M&#xfc;ller</Nachname>

 <Geschlecht>&#x2642;</Geschlecht>

</Name>

</Namen>
On lines 7 and 8 are the decimal character references &#252; and &#9792;, respectively. The first one refers to the letter u with an umlaut (ü) and the second one is a female sign. Lines 12 and 13 use the hexadecimal character references &#xfc; (ü) and &#x2642; (male sign), respectively. You can see how these character references are rendered in Opera in Figure 1-6.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Examine XML Documents in Text Editors
Even plain-text editors offer features that make editing XML documents a pleasure. This hack introduces two options, Vim and Emacs with nXML.
XML has been called "Unicode with pointy brackets." As such, XML documents can be displayed in your average, run-of-the-mill, non-graphical text editor. Of course you could view, create, and edit XML documents in Notepad on Windows, but it's not a very exciting editing environment (see http://tucows.com/htmltext95_default.html for examples of other text editors).
There are a number of text editors that are quite suitable for working with XML. We'll talk about two of them here: Vim (Vi improved, a clone of Vi) and Emacs. Both are free for the downloading.
If you are accustomed to a point-and-click, graphical user interface for editing text [Hack #6] , you probably won't like using Vim or Emacs with XML. If, however, you prefer typing at the keyboard over clicking the mouse (like me), this hack is for you.
Vim (http://www.vim.org) is a derivative of the Unix screen editor, Vi. It is currently at Version 6.3 and is developed under the leadership of Bram Moolenaar. You can get flavors of Vim that run on Unix (such as Red Hat, Sun Solaris, or Debian), Windows, MS-DOS, the Mac, OS/2, and even Amiga (downloads available at http://www.vim.org/download.php). If you are running recent versions of Red Hat (http://www.redhat.com) or Cygwin for Windows (http://www.cygwin.com), you likely already have Vim installed on your system.
Vi was developed by Bill Joy et al. in the late 1970s for Unix (http://www.cs.pdx.edu/~kirkenda/joy84.html). Vi was the first screen editor I ever used—back in 1983—and I still use Vim almost every day. Vim is powerful, and without elaborating on all the reasons why I like to use Vim, I will mention just one: syntax highlighting.
Sure, syntax highlighting is available in other editors, but Vim supports over 300 languages with syntax highlighting. Syntax highlighting helps you see clearly that what you are typing is correct because it assigns colors to the correct syntax of a given language, such as XML. This can help you detect typing errors readily. (See a FAQ on Vim syntax highlighting at
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Explore XML Documents in Graphical Editors
Text editor not enough for you? This hack looks at XML documents with graphical editors.
Along with XML has come a thundering horde of graphical XML editors that do everything short of buttering your toast. Many editors are readily available (see http://www.xmlsoftware.com/editors.html for a comprehensive though not exhaustive list), but I'll mention only a few safe bets here.
xmlspy 2004 by Altova (http://www.xmlspy.com) is a feature-rich, graphical editor for XML for the Windows environment. xmlspy has also been tested on Red Hat Linux running Wine, and Mac OS/X running Microsoft Virtual PC for Mac. The Home Edition of this popular editor is available for free, but you must pay for licensess for Professional and Enterprise editions. I'll give you a quick feature fly-over of xmlspy—though there are a number of features I won't get around to mentioning.
xmlspy can help you create documents and schemas by hand or from templates (examples), organize work into projects, and import text and database files. You can view documents as text with syntax highlighting or in a grid view, check spelling, validate against DTDs and XML Schema documents, perform XSLT transformations [Hack #33] and evaluate XPath location paths. xmlspy provides support for WSDL (http://www.w3.org/TR/wsdl) and SOAP [Hack #63] . You can also use xmlspy to generate Java, C++, or C# code [Hack #99] from DTDs or XML Schema documents.
Figure 1-10 shows the document valid.xml in xmlspy with helper panes on the right. These panes let you insert elements, attributes, and entities with a single click. The Project pane on the left gives you quick access to all kinds of templates.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Choose Tools for Creating an XML Vocabulary
XML provides the syntax necessary to create your own vocabulary or dialect of XML. Here are a few things you need to know about namespaces and schemas.
One of the best things about XML is that you can create your own tags—a vocabulary or dialect—if you want. To create a vocabulary, you should understand a couple of things about schemas and namespaces. You can use XML without schemas or namespaces, but sometimes you want to use one, the other, or both. This hack explains when you'll want to use schemas and namespaces and when you'll want to avoid them.
XML documents must be well-formed. This means that they must adhere to the syntax defined in the XML specification (http://www.w3.org/TR/REC-xml/). This syntax mandates such things as matching case in tag names, matching quotes around attribute values, restrictions on what Unicode characters may be used, and so on.
An XML document may also be valid. This means that such a document must conform to the restrictions laid out in an associated schema. Basically, a schema declares or defines what elements and attributes are allowed in a valid instance, including in what order the elements may appear. Governing document layout with schemas can greatly increase the reliability, consistency, and accuracy of exchanged documents.

Section 1.8.1.1: DTD

The native schema language of XML is the document type definition or DTD [Hack #68] , which is part of the XML specification and which XML inherited, in simplified form, from SGML. The document valid.xml in Example 1-7 uses a document type declaration (shown in boldface) to associate a DTD with itself.
Example 1-7. valid.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE time SYSTEM "time.dtd">
  
<!-- a time instant -->
<time timezone="PST">
 <hour>11</hour>
 <minute>59</minute>
 <second>59</second>
 <meridiem>p.m.</meridiem>
 <atomic signal="true"/>
</time>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Test XML Documents Online
Are your XML documents syntactically correct? Find out how and where to check XML documents using online resources.
Several web sites allow you to test your XML documents online to make sure that they are well-formed and/or valid. This hack introduces three such sites: RUWF, RXP, and Brown University's XML validation form.
One site that does well-formedness checks is XML.com's RUWF—Are You Well-Formed? (http://www.xml.com/pub/a/tools/ruwf/check.html)—which is implemented in Perl using XML::Parser (http://www.perl.com/pub/a/1998/11/xml.html). RUWF accepts a URL for an XML document or allows you to paste an XML document into a text box.
Figure 1-14 shows a copy of time.xml pasted into the text box, and Figure 1-15 shows the result of clicking the RUWF? button. (You could also test an online copy of time.xml, http://www.wyeast.net/time.xml, by entering the URL into the "Your URL" text box.)
Figure 1-14: XML.com's RUWF
Figure 1-15: Results of checking time.xml with RUWF
Richard Tobin of the University of Edinburgh has created RXP, a validating XML processor that is available online (http://www.cogsci.ed.ac.uk/~richard/xml-check.html) or from the command line [Hack #8] .
As mentioned earlier, the document time.xml is available on my web site at http://www.wyeast.net/time.xml. Figure 1-16 shows you how to check this document for well-formedness using the online version of RXP. Enter the URL in the text box, and then click the button labeled "check it."
The result is displayed as canonical XML (http://www.w3.org/TR/xml-c14n) in Figure 1-17. Canonical XML defines a method for outputting XML in a consistent, reliable way, leaving some things behind in output, such as the XML declaration and, optionally, comments.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Test XML Documents from the Command Line
A number of free, easy-to-use XML processors are available for use on the command line. This hack shows where to get four such tools and how to use them.
You can check XML documents for well-formedness and validity using tools on the command line or shell prompt. This hack discusses four tools: Richard Tobin's RXP, Elcel's XML Validator (xmlvalid), Daniel Veillard's xmllint, and xmlwf (an application based on James Clark's Expat C library).
You've already seen the online version of RXP [Hack #8] . This hack shows you how to use the command-line version, available free at http://www.cogsci.ed.ac.uk/~richard/rxp.html. For Windows and other platforms, you can download the C source and compile it yourself (ftp://ftp.cogsci.ed.ac.uk/pub/richard/rxp.tar.gz) or, if you are on Windows, you can simply download the executable rxp.exe (ftp://ftp.cogsci.ed.ac.uk/pub/richard/rxp.exe).
Once you've downloaded RXP and placed it in your path, you can check XML documents for well-formedness at a command prompt with this:
rxp time.xml
Upon success, this command will produce the output shown in Example 1-12.
Example 1-12. Output of RXP with time.xml
<?xml version="1.0" encoding="UTF-8"?>
<!-- a time instant -->
<time timezone="PST">
    <hour>11</hour>
    <minute>59</minute>
    <second>59</second>
    <meridiem>p.m.</meridiem>
    <atomic signal="true"/>
</time>
You can also check a document for validity by using the -V option, provided it has an accompanying DTD (as valid.xml does):
rxp -V valid.xml
When successful, you will see the output in Example 1-13.
Example 1-13. Output of RXP with valid.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE time SYSTEM "time.dtd">
<!-- a time instant -->
<time timezone="PST">
        <hour>11</hour>
        <minute>59</minute>
        <second>59</second>
        <meridiem>p.m.</meridiem>
        <atomic signal="true"/>
</time>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Run Java Programs that Process XML
Open source, command-line Java programs that process XML are abundant. This hack shows you how to use them.
The Java programming language (http://java.sun.com) has been a popular object-oriented language since it was unveiled by Sun in the mid-1990s. One key idea behind Java was that it made it possible to write and compile a program once, and then run it on any machine that supports a Java interpreter ("write once, run anywhere"). Note that it's not a perfect programming language—I've heard Ted Ts'o (http://thunk.org/tytso/) say of Java, "Write once, run screaming."
Nonetheless, Java is widespread and generally well liked, and you'll find many command-line Java programs that can process XML in one way or another. A number of these programs appear in this book, so this hack walks you through how to use them.
This hack assumes that you know little to nothing about Java. If you are entirely new to Java, the information at http://java.sun.com/learning/new2java/ will also help you get up to speed quickly.
To get a Java program to run on your system, you need a Java virtual machine (VM), part of the Java runtime environment (JRE). One may already be on your system, but to get the latest JRE anyway, go to http://java.sun.com and find the link for the Java VM download. (There are alternatives to Sun's VM, such as one offered on http://www.kaffe.org/, but I'm only going to talk about the Sun VM here.) In a few clicks, the new VM will be downloaded to your machine. You should then be able to go to a command prompt and type:
java -version
and get a response that looks something like the following:
java version "1.4.2_03"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_03-b02)
Java HotSpot(TM) Client VM (build 1.4.2_03-b02, mixed mode)
A more recent version may be available, but if you get a reply similar to this, you're in business. If not, consult the installation instructions for Windows (http://java.sun.com/j2se/1.4.2/install-windows.html
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Creating XML Documents
The collection of hacks in this chapter introduces you to different ways to edit and create XML documents. You'll get more detailed introductions to editing XML with <oXygen/> [Hack #11] , withEmacs plus nXML [Hack #12] , and with Vim [Hack #13] . You'll also get exposure to three Microsoft Office 2003 applications: Word [Hack #14] , [Hack #15] , and Access [Hack #16] .
Several hacks show you how to create XML from plain text [Hack #9] and [Hack #10] and from comma-separated values (CSV) [Hack #21] files. You'll execute an XQuery [Hack #24] , and learn about encoding documents [Hack #27] and including text and documents with entities [Hack #6] and XInclude [Hack #26] .
Reminder: all the example files mentioned in this chapter are available from the file archive that can be downloaded from http://www.oreilly.com/catalog/xmlhks/.
Quickly learn how to edit XML documents with <oXygen/>
In Chapter 1, you got an introduction to a few graphical editors [Hack #6] . This hack provides more highlights on how to edit documents using the graphical editor <oXygen/> (http://www.oxygenxml.com/). I have chosen <oXygen/> because it runs on multiple platforms, is inexpensive (it has a free trial and its license is less than $100 USD), and offers many useful features.
Figure 2-1 shows <oXygen/> editing time.xml and valid.xml, both part of the project time.xpr. Note the project pane (upper left) and the tabs above the document pane. The lower-left pane shows an outline view of valid.xml (note that the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hacks #11-30
The collection of hacks in this chapter introduces you to different ways to edit and create XML documents. You'll get more detailed introductions to editing XML with <oXygen/> [Hack #11] , withEmacs plus nXML [Hack #12] , and with Vim [Hack #13] . You'll also get exposure to three Microsoft Office 2003 applications: Word [Hack #14] , [Hack #15] , and Access [Hack #16] .
Several hacks show you how to create XML from plain text [Hack #9] and [Hack #10] and from comma-separated values (CSV) [Hack #21] files. You'll execute an XQuery [Hack #24] , and learn about encoding documents [Hack #27] and including text and documents with entities [Hack #6] and XInclude [Hack #26] .
Reminder: all the example files mentioned in this chapter are available from the file archive that can be downloaded from http://www.oreilly.com/catalog/xmlhks/.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Edit XML Documents with <oXygen/>
Quickly learn how to edit XML documents with <oXygen/>
In Chapter 1, you got an introduction to a few graphical editors [Hack #6] . This hack provides more highlights on how to edit documents using the graphical editor <oXygen/> (http://www.oxygenxml.com/). I have chosen <oXygen/> because it runs on multiple platforms, is inexpensive (it has a free trial and its license is less than $100 USD), and offers many useful features.
Figure 2-1 shows <oXygen/> editing time.xml and valid.xml, both part of the project time.xpr. Note the project pane (upper left) and the tabs above the document pane. The lower-left pane shows an outline view of valid.xml (note that the hour element is highlighted in both the outline and document panes). Beneath the document pane is a tabbed pane that shows the result of a transformation of valid.xml with XSLT.
Figure 2-1: <oXygen/>
Like any editor, <oXygen/> allows you to do normal editing tasks, such as undo and redo, spell check, and so forth. Here is a list of some of <oXygen/>'s more important features.
Projects
<oXygen/> can organize files into groups called projects (see the File menu). These projects can be named and saved in simple XML project files that have an .xpr file extension. All the files in a project can be validated in one fell swoop. When you reopen a project, it remembers some state information, such as what file was last opened and whether it had focus.
Document creation
<oXygen/> provides templates for creating XML documents in a variety of vocabularies: DocBook, SMIL, SVG, TEI, VoiceXML, WML, WSDL, and XHTML, to name but a few. It also has syntax highlighting, which can be edited under Options
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Edit XML Documents with Emacs and nXML
nXML mode for GNU Emacs provides a powerful environment for creating valid XML documents.
If you've been editing XML from within GNU Emacs using PSGML, here's a tip: get rid of it. That's right, tear it out, dump it, make it disappear—because there's a much better tool available: nXML. (Grab the latest nxml-mode-200nnnnn.tar.gz file from http://www.thaiopensource.com/download/.) nXML was developed by James Clark, the man who brought us groff, expat, sgmls, SP, and Jade, as well as being a driving force behind the development of XPath, XSLT (and before that, DSSSL), and, along with Murata Makoto, RELAX NG (http://www.relaxng.org/).
Which brings us back to what nXML is all about: nXML is a very clever mechanism for doing RELAX NG-driven, context-sensitive, validated editing. What's particularly clever about it is that, unlike PSGML and unlike virtually every other XML editing application available—with the exception of the Topologi Collaborative Markup Editor (http://www.topologi.com/products/tme/)—it provides real-time, automatic visual identification of validity errors.
This hack assumes that you are familiar with Emacs. The README file that comes with nXML states that you must use Emacs version 21.x (preferably 21.3 or later) in order to use nXML. To get nXML to run in Emacs, you must first load the rng-auto.el file. In Emacs, type:
M-x load-file
Then load the file rng-auto.el from the location where you downloaded and extracted the latest version of nXML. This file defines the autoloads for nXML. Now open an XML document (C-x C-f) and enter:
M-x nxml-mode
You are good to go! For help, type:
C-h m
What "automatic visual identification of validity errors" means is that if you create and edit documents using nXML, you never need to manually run a separate validation step to determine whether a document is valid; i.e., if a document contains a validity error, you will know instantly as you edit the document because it will be visually flagged. Here's how it works. As you're editing a document:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Edit XML with Vim
With some special configuration, Vim can become a powerful XML editor.
So you want to edit XML, but Vim is your favorite editor? The good news is that you don't need an XML-specific editor! If you're mortal, you'll soon discover that editing raw XML can become tedious even in Vim (with its default configuration). But Vim is highly customizable and extensible. After a little tailoring, Vim performs excellently as an XML editor, with syntax highlighting, automatic indentation, navigational aids, and automation.
I will assume you have Vim set up the way you like it already on a Unix system, so we won't fiddle much with your .vimrc file. Example 2-1 shows the bare minimum of what you need to make the rest of the hack work properly.
Example 2-1. Minimum .vimrc file
" $HOME/.vimrc
" Don't pretend to be vi
set nocompatible
" Turn on syntax highlighting
syntax on
" Indicate that we want to detect filetypes and want to run filetype
" plugins.
filetype plugin on
Everything else will go in a filetype plug-in . Vim will source this file when it detects that you are editing an XML file (i.e., when the file ends with the .xml suffix or if it has a proper XML declaration). Example 2-2 is a good starter ftplugin. Save it to your home directory as .vim/after/ftplugin/xml.vim. (The file xml.vim is in the book's file archive.) The after segment of the path means that it will be sourced after all the normal scripts, plug-ins, and so on are sourced, which allows you to override defaults and other plug-ins without changing the original scripts. That makes upgrading those scripts easier.
Example 2-2. The ftplugin xml.vim
" $VIMRUNTIME/after/ftplugin/xml.vim
" Turn on auto-indentation
   
set autoindent
   
" Let's use a 2-character indent
   
set shiftwidth=2
   
" With smarttab set, we can press tab at the beginning
" of a line and get shiftwidth indent even though
" tabstop is something else (e.g. the default 8)
   
set smarttab
   
" A lot of XML looks really bad and gets really confusing if
" screen-wrapped. I prefer to turn off wrapping.
   
set nowrap
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Edit XML Documents with Microsoft Word 2003
Edit, validate, and save XML documents with Microsoft Word 2003.
Microsoft Office 2003 has the best XML support that a version of Office has offered yet. It's not perfect, but in some places it shines. Not all Office 2003 products provide direct XML support, but three of the flagship products do—Microsoft Word 2003, Excel 2003, and Access 2003. This hack will discuss how to "do XML" with Word 2003.
Sadly, not all versions of Word 2003 have full-featured XML support. In order to get the full support, you need to buy Office 2003 Professional, Office 2003 Enterprise, or Word 2003 individually. Word has its own built-in schema called WordprocessingML. If you create a regular document in Word, you can save the document as XML in WordprocessingML. All versions of Word 2003 have this capability.
In the Office 2003 Professional, Office 2003 Enterprise, and individually packaged Word 2003 versions of Word 2003, you can attach your own XML Schema [Hack #69] document to an XML document. This means that you can export Word documents as XML, and they will be structured according to your own custom schema rather than Word's obscure binary format or its own WordprocessingML. This means that you can test and validate such documents using external XML tools—in other words, you aren't landlocked if you use the professional, enterprise, or individual versions of Word to produce XML.
You can store or attach XML schema in Word's schema library, and you can validate XML documents against their schema. To add a schema to Word's library, go to Tools Templates and Add-ins and then click the XML Schema tab. Now click the Add Schema button and navigate to the working directory where you will find the schema time.xsd. Click Open. You will be asked to associate a URI with the schema (any URI seems to work). Click "Validate document against attached schema" and "Allow saving as XML even if not valid," then click OK. The result will look like Figure 2-7.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Work with XML in Microsoft Excel 2003
Using table-structured data or spreadsheets? Open, format, and save XML documents with Excel 2003.
Microsoft Excel 2003 offers unprecedented XML support. As with Word, full XML features are not available except with the Microsoft Office Professional Edition 2003, Enterprise Edition 2003, and the individual version. Other versions (the Standard or Small Business editions) won't have XML support except for the ability to save a file in SpreadsheetML format.
Excel 2003 allows you to open an XML document and then save or export data as XML. Choose File Open and then navigate to the file time.xml in the Open dialog box. Select the file and then click Open. You can open the document in one of three ways: as an XML list, as a read-only workbook, or by using the XML Source task pane.
When you open an XML file as an XML list, Excel automatically creates an XML Schema that corresponds with the XML (it warns you of that). It also maps each of the attribute values and the content from each of the elements to a cell in the spreadsheet. The XML Source task pane lists each of the elements in the imported document in a tree view. As a cell is highlighted in the spreadsheet, the corresponding element or attribute is highlighted in the task pane.
If the task pane does not appear automatically, choose Data XML XML Source.
In Figure 2-12, notice that in the XML Source pane, the hour element is highlighted; it is associated through a mapping with cell B1, which is also highlighted. If you were to select cell C1, the minute element in the XML Source pane would be highlighted.
Figure 2-12: Mapping time.xml to fields in Excel 2003
If you open an XML document as a read-only workbook, no mappings or schema generation occurs, but cells are labeled with the names of elements and attributes automatically, with the labels resembling XPath location paths.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Work with XML in Microsoft Access 2003
If you are a Microsoft Access user, you'll be happy to know that you can export Access 2003 data as XML.
Microsoft Access 2003 is Office's database application. You can create a table of data—an Access database—and label each field with a name you'd like to use as an XML element name. One way to get started is by importing an existing XML document into Access. Here's how to do it.
Open Access, and then select File Get External Data Import. In the Import dialog box, make sure it says XML in the "Files of type" pull-down menu. Navigate to the working directory and click on the file time.xml. Then click Import. Not all information is preserved, but close.
You will then see the Import XML dialog box. Click on the Options button, and the dialog will appear as it does in Figure 2-14. You can choose to import the XML structure only (i.e., only the markup) or the structure with data (i.e., the markup and content). You can also choose to append the data to an existing table; i.e., a table with the same name as the original document (in this example, time). If you append the data, the content of the XML document is added to a record of the database file using the same fields that are created from the element names.
Figure 2-14: Import XML dialog box in Access 2003
After you have imported the document, you should see a database table in the navigator view of Access, as shown in Figure 2-15. Click on the table's icon to open it. In Figure 2-16, you can see that the fields are labeled with the names of elements in time.xml.
Figure 2-15: time table in Access 2003 navigator view
Figure 2-16: time.mdb in Access 2003
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Convert Microsoft Office Files, Old or New, to XML
Use OpenOffice as a tool to convert Microsoft Office files to XML.
OpenOffice (http://www.openoffice.org/), the free, open source, multiplatform office application suite that provides an alternative to Microsoft Office, uses a documented XML format as its native file format. Put this together with OpenOffice 1.1's ability to read Word, Excel, and PowerPoint files from Office 97, 2000, and XP, plus Word 6.0 files, Word 95 files, and Excel 4.0, 5.0, and 95 files, and you've got a simple way to convert these files to XML.
When you store a document in OpenOffice's own file format [Hack #65] , you'll create a ZIP file with the extension .sxw if you saved it with the OpenOffice Writer word processing program, .sxc if you saved it with the OpenOffice Calc spreadsheet program, or .sxi if you used the OpenOffice Impress slideshow program. The six files that you'll find in these ZIP files have self-explanatory names: mimetype, content.xml, styles.xml, meta.xml, settings.xml, and manifest.xml.
Unless you're strongly interested in the inner workings of OpenOffice, the file content.xml should hold the most interest. Along with file content, it stores information about the use of built-in styles, styles you defined yourself, and even on-the-fly styling information not tied to defined styles, such as bolding of text with Ctrl-B. For word-processing files, the XML also identifies bulleted and numbered lists and footnotes. XML versions of spreadsheets include information about spanned cells and calculation formulas as well as results, and OpenOffice XML versions of slideshows store separate slides in separate elements, with slide notes in their own elements. (As soon as I found out about that, I wrote an XSLT stylesheet to pull slide titles and slide notes, minus slide content, into a single document that I could print and hold in my hand when giving presentations—something I'd always wanted to do when giving PowerPoint presentations, but could not.)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Create an XML Document from a Text File with xmlspy
How do you get your old stuff into XML? Legacy text files can be translated into XML with xmlspy.
Perhaps you have plain-text files that you'd like to convert to XML so that the data will interoperate with the latest applications. You can do it by hand with a text or XML editor or you can use a tool that will do it for you automatically. xmlspy (Professional or Enterprise edition) is one of those tools. It's easy to figure out xmlspy's text-to-XML interface, so that's the one I'll show you here. (I used the Enterprise edition when testing this.)
First, here is a little plain-text file, time.txt, that just contains data fields separated by semicolons:
timezone;hour;minute;second;meridiem;atomic
PST;11;59;59;p.m.;
The first line defines fields that will be converted to XML markup; the second line defines the content of that markup. A semicolon (;) delimits each of the fields. The second line ends with a field containing a single space, which of course you can see.
Now open xmlspy and select Convert Import Text file. The Text import dialog box is shown in Figure 2-18. Click the Choose File button and open the file time.txt. Make sure that the file encoding is Unicode UTF-8, the field delimiter is Semicolon, and that "First row contains field names" is checked.
Click the symbol to the left of the timezone field name in the first row so that it becomes an equals sign. This specifies that the timezone field will be interpreted as an attribute in the output. Then click OK.
Figure 2-18: time.txt in xmlspy's Text Import dialog box
Click the Text label at the bottom of the document pane to see the result in Figure 2-19. The XML declaration and the import and row elements were inserted by xmlspy; the remaining elements were derived from time.txt. You could change the new document by hand to match time.xml (from Chapter 1), or you could apply an XSLT stylesheet to it. XSLT hacks begin in earnest in Chapter 3, but I'll use an XSLT stylesheet here (without going into detail about the stylesheet itself) to show you how to shape this document up.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Convert Text to XML with Uphill
This hack is a little different. It shows you how to convert plain text to XML using Dave Pawson's Java program, Uphill. Along the way, Dave also explains how and why he developed the software, which may be helpful for those developing their own text-to-XML packages in Java.
Text without any formatting is boring and repetitive to mark up XML—just the sort of problem that a computer is good at, except that most text is not regular, which is the cost side of automation. I decided to try to create a solution in which the cost would be less for any automated solution over a by-hand conversion. That's why I wrote Uphill (http://www.dpawson.co.uk/java/uphill/), a Java program for converting plain text into XML.
The goal for the program was to output a new file containing the XML markup for headings, paragraphs, and acronyms (needed for Braille output). First, I prototyped a solution with Python (http://www.python.org/) because Python has dictionaries that can be preloaded. I had a list of acronyms that I quickly converted into a Python structure to initialize a dictionary. The match I used was:
if acrs.has_key(str[i:i+4]):
I walked the input string, testing for four-letter, then three-letter, then two-letter acronyms. It worked, and though it was weak, it gave me enough confidence to move on.
A line from my acronym file looks like this:
USA:<acr>USA</acr>
That is, the acronym USA is marked up with the acr tag. I realized that some acronyms may be generalized. If the first two letters can be captured, any remaining uppercase letters were probably a part of the acronym. I came up with this as an entry:
BD:*
This tells me that if I spot BD, I can keep on looking for more uppercase letters, up until a terminal.
Download, unzip, and install Uphill in the working directory. Type this command:
java -jar uphill.jar
You will then see this usage information:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Create Well-Formed XML with Minimal Manual Tagging Using an SGML Parser
Convert minimal markup into XML with James Clark's SP.
The problem of converting plain text into basic, well-formed XML occurs over and over again in XML processing. As a general rule, I like to get data into XML as quickly as possible and leave it in XML for as long as possible (preferably forever). The sooner I can get data into XML, the sooner I can bring all my XML-processing tools and knowledge to bear on the data-processing challenges.
When the volume of markup to be created is small, hand-editing using one-off text editor macros is a powerful technique. For higher volumes of markup, a custom program is often the best way to go—Python, Ruby, and Perl, for example, all excel at this sort of work.
Sometimes, the quickest way to get data into XML is by combining judicious use of hand-edits and automatic addition of the markup required using an SGML parser. XML is a subset of a much larger markup technology standard known as SGML (ISO 8879:1986), which has been an international standard since 1986. SGML provides a variety of mechanisms, not found in XML, to minimize the amount of tagging required in documents. Collectively, these techniques are known as markup minimization features. By using an SGML parser to process text, it is possible to take advantage of the tag minimization features to automatically add markup and help create well-formed XML documents.
In these examples, we will use James Clark's SP SGML parser. You can download it from http://www.jclark.com/sp/. The examples in this hack assume that SP has been installed in the working directory for the book's files.
You may already be familiar with some of SGML's tag minimization capabilities, as they are used extensively in HTML. (HTML is an example of an SGML application—by far the most successful SGML application in the world.)
The most common tag minimization technique from SGML used in HTML is known as tag omission. Here is a small HTML document,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Create an XML Document from a CSV File
Want to go from CSV to XML? Use Dave Pawson's CSVToXML tool to convert CSV files to XML with Java.
Dave Pawson's CSVToXML translator converts comma-separated value (CSV) files to XML. CSV is a reliable, plain-text file format for the storing the output of a spreadsheet or database.
Suppose you are running Excel 2000 and you want to convert a file, inventory.xls, to XML (see Figure 2-24). Unfortunately, you haven't been able to talk your boss into buying Excel 2003 yet, which could easily output the spreadsheet as XML. Luckily, there is a workaround.
Figure 2-24: inventory.xls in Excel 2000
Save the file as CSV by choosing File Save As and selecting a CSV file format in the "Save as type" pull-down box. Navigate to the working directory where the other files from this book are, enter the name inventory.csv in the File name text box, and then click the Save button. The CSV file will appear as follows:
line,desc,quan,date
1,Oak chairs,6,31-Dec-04
2,Dining tables,1,31-Dec-04
3,Folding chairs,4,29-Dec-04
4,Couch,1,31-Dec-04
5,Overstuffed chair,1,30-Dec-04
6,Ottoman,1,31-Dec-04
7,Floor lamp,1,20-Dec-04
8,Oak bookshelves,1,31-Dec-04
9,Computer desk,1,31-Dec-04
10,Folding tables,3,31-Dec-04
11,Oak writing desk,1,28-Dec-04
12,Table lamps,5,26-Dec-04
13,Pine night tables,3,26-Dec-04
14,Oak dresser,1,30-Dec-04
15,Pine dressers,1,31-Dec-04
16,Pine armoire,1,31-Dec-04
Download the latest version of CSVToXML from http://www.dpawson.co.uk/java/index.html and extract the JAR file CVSToXML.jar from the ZIP archive and place it in the working directory. Enter this command:
java -jar CSVToXML.jar
If you see this output, you are ready to roll:
No property File available; Quitting
CSVToXML 1.0 from Dave Pawson
Usage: java CSVToXML [options] {param=value}...
Options:
  -p filename     Take properties from named file
  -o filename     Send output to named file
  -i filename     Take CSV input from named file
  -t              Display version and timing information
  -?              Display this message
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Convert an HTML Document to XHTML with HTML Tidy
HTML Tidy was initially developed as a tool to clean up HTML, but it is an XML tool, too. This hack shows you how to use HTML Tidy to make your HTML into XHTML.
HTML Tidy was initially developed at the W3C by Dave Raggett (http://www.w3.org/People/Raggett/#tidy). Essentially, it's an open source HTML parser with the stated purpose of cleaning up and pretty-printing HTML, XHTML, and even XML. It is now hosted on Sourceforge (http://tidy.sourceforge.net). You can download versions of Tidy for a variety of platforms there.
Example 2-10 shows an HTML document, goodold.html, which we will run through HTML Tidy.
Example 2-10. goodold.html
<HTML>
<HEAD><TITLE>Time</TITLE></HEAD>
<BODY style="font-family:sans-serif">
<H1>Time</H1>
<TABLE style="font-size:14pt" cellpadding="10">
<TR>
 <TH>Timezone</TH>
 <TH>Hour</TH>
 <TH>Minute</TH>
 <TH>Second</TH>
 <TH>Meridiem</TH>
 <TH>Atomic</TH>
</TR>
<TR>
 <TD>PST</TD>
 <TD>11</TD>
 <TD>59</TD>
 <TD>59</TD>
 <TD>p.m.</TD>
 <TD>true</TD>
</TR>
</TABLE>
</BODY>
</HTML>
Assuming that Tidy is properly installed, you can issue the following command to convert goodold.html to the XHTML document goodnew.html using the -asxhtml switch:<