XML. These three letters have brought shivers to almost every developer in the world today at some point in the last two years. While those shivers were often fear at another acronym to memorize, excitement at the promise of a new technology, or annoyance at another source of confusion for today’s developer, they were shivers all the same. Surprisingly, almost every type of response was well merited with regard to XML. It is another acronym to memorize, and in fact brings with it a dizzying array of companions: XSL, XSLT, PI, DTD, XHTML, and more. It also brings with it a huge promise: what Java did for portability of code, XML claims to do for portability of data. Sun has even been touting the rather ambitious slogan “Java + XML = Portable Code + Portable Data” in recent months. And yes, XML does bring with it a significant amount of confusion. We will seek to unravel and demystify XML, without being so abstract and general as to be useless, and without diving in so deeply that this becomes just another droll specification to wade through. This is a book for you, the Java developer, who wants to understand the hype and use the tools that XML brings to the table.
Today’s web application now faces a wealth of problems that were not even considered ten years ago. Systems that are distributed across thousands of miles must perform quickly and flawlessly. Data from heterogeneous systems, databases, directory services, and applications must be transferred without a single decimal place being lost. Applications must be able to communicate not only with other business components, but other business systems altogether, often across companies as well as technologies. Clients are no longer limited to thick clients, but can be web browsers that support HTML, mobile phones that support the Wireless Application Protocol (WAP), or handheld organizers with entirely different markup languages. Data, and the transformation of that data, has become the crucial centerpiece of every application being developed today.
XML offers a way for programmers to meet all of these requirements. In addition, Java developers have an arsenal of APIs that enable them to use XML and its many companions without ever leaving a Java Integrated Development Environment (IDE). If this sounds a little too good to be true, keep reading. You will walk through the pitfalls of the various Java APIs as well as look at some of the bleeding-edge developments in the XML specification and the Java APIs for XML. Through it all, we will take a developer’s view. This is not a book about why you should use XML, but rather how you should use it. If there are offerings in the specification that are not of much use, details of why will be clearly given and we will move on; if something is of great value, we’ll spend some extra time on it. Throughout, we will focus on using XML as a tool, not using it as a buzzword or for the sake of having the latest toy. With that in mind, let’s begin to talk about what XML is.
XML is the Extensible Markup
Language
. Like its predecessor
SGML, XML is a meta-language used to define other languages. However,
XML is much simpler and more straightforward than SGML. XML is a
markup language that specifies neither the tag set nor the grammar
for that language. The tag set
for a markup
language defines the markup tags that have meaning to a language
parser. For example, HTML has a strict set of tags that are allowed.
You may use the tag <TABLE>
but not the tag
<CHAIR>
. While the first tag has a specific
meaning to an application using the data, and is used to signify the
start of a table in HTML, the second tag has no specific meaning, and
although most browsers will ignore it, unexpected things can happen
when it appears. That is because when HTML was defined, the tag set
of the language was defined with it. With each new version of HTML,
new tags are defined. However, if a tag is not defined, it may not be
used as part of the markup language without generating an error when
the document is parsed. The grammar
of a markup
language defines the correct use of the language’s tags. Again,
let’s use HTML as an example. When using the
<TABLE>
tag, several attributes may be
included, such as the width, the background color, and the alignment.
However, you cannot define the TYPE
of the table
because the grammar of HTML does not allow it.
XML, by defining neither the tags nor the grammar, is completely
extensible; thus its name. If you choose to use the tag
<TABLE>
and then nest within that tag
several <CHAIR>
tags, you may do so. If you
wish to define a TYPE
attribute for the
<CHAIR>
tag, you may do that also. You could
even use tags named after your children or co-workers if you so
desired! To demonstrate, let’s take a look at the XML file
shown in Example 1.1.
Example 1-1. A Sample XML File
<?xml version="1.0"?> <dining-room> <table type="round" wood="maple"> <manufacturer>The Wood Shop</manufacturer> <price>$1999.99</price> </table> <chair wood="maple"> <quantity>2</quantity> <quality>excellent</quality> <cushion included="true"> <color>blue</color> </cushion> </chair> <chair wood="oak"> <quantity>3</quantity> <quality>average</quality> </chair> </dining-room>
If you have never looked at an XML file, but are familiar with HTML
or another markup language, this may look a bit strange to you.
That’s because the tags and grammar being used are completely
made up. No web page or specification defines the
<table>
, <chair>
,
or <cushion>
tags (although one could, just
as the XHTML specification defines HTML tags in XML); they are
completely concocted. This is the power of XML: it allows you to
define the content of your data in a variety of ways as long as you
conform to the general structure that XML requires. Later we will go
into detail on some additional constraints, but for now it is
sufficient to realize that XML is built to allow flexibility of data
formatting.
Although this flexibility is one of XML’s strongest points, it also creates one of its greatest weaknesses: because XML documents can be processed in so many different ways and for so many different purposes, there are a large number of XML-related standards to handle translation and specification of data. These additional acronyms, and their constant pairing with XML itself, often confuse what XML is and what it is not. More often than not, when you hear “XML,” the speaker is not referring specifically to the Extensible Markup Language, but to all or part of the suite of XML tools. Although sometimes these will be referred to separately, be aware that “XML” does not just mean XML; more often it means “XML and all the great ways there are to manipulate and use it.” With those preliminaries out of the way, we are ready to define some of the most common XML acronyms and give short descriptions of each. These will be fundamental to everything else in the book, so keep this chapter marked for reference. These descriptions should start to help you understand how the XML suite of tools fits together, what XML is, and what it isn’t. Discussion of publishing engines, applications, and tools for XML is avoided; these are discussed later when we talk about specific XML topics. Rather, this section only refers to specifications and recommendations in various stages of consideration. Most of these are initiatives of the W3C, the World Wide Web Consortium. This group defines standards for the XML community that help provide a common base of knowledge for this technology, much as Sun provides standards for Java and related APIs. For more on the W3C, visit http://www.w3.org on the Web.
XML, of course, is the root of all these three- and four-letter acronyms. It defines the core language itself and provides a metadata-type framework. XML by itself is of limited value; it defines only that framework. However, all of the various technologies that rest upon XML provide developers and content managers unprecedented flexibility in data management and transmission. XML is currently a completed W3C Recommendation, meaning it is final and will not change until another version is released. For the complete XML 1.0 Specification, see http://www.w3.org/TR/REC-xml/. As this specification is tough to read through for even the XML-savvy, an excellent annotated version of the specification is available at http://www.xml.com.
As we will spend lots of time going into detail on this subject in
future chapters, there are only two basic concepts you need to
understand about XML documents right now. The first is that any XML
document must be well-formed
to be of any use
and to be parsed correctly. A well-formed document is one that has
every tag closed that is opened, has no tags nested out of order, and
is syntactically correct in regard to the specification. You may be
wondering: didn’t we say that XML has no syntax rules? Not
exactly; we said that it did not have any
grammatical rules. While the document can define
its own tags and attributes, it still must conform to a general set
of principles. These principles are then used by XML-aware
applications and parsers to make sense of the document and perform
some action with the data, such as finding the price of a chair or
creating a PDF file from the data within a document. We will discuss
these details in greater depth in Chapter 2.
The second basic concept concerning XML documents is that they can
be, but are not required to be, valid
. A valid
document is one that conforms to its document type definition (DTD),
which we’ll talk about in a moment. Simply put, a DTD defines
the grammar and tag set for a specific XML formatting. If a document
specifies a DTD and follows that DTD’s rules, it is said to be
a valid XML document. XML documents can also be constrained by a
schema, a new way of dictating XML format that will replace DTDs.
When a document conforms to a schema, it can be said to be
schema valid
. Don’t worry if this
isn’t all clear yet; we have a long way to go, and we will look
at each of these XML-related specifications. First, though, there are
some acronyms and specifications that are used within an XML
document. Let’s take a look at these now.
A PI in an XML document is a
processing
instruction
. A processing
instruction tells an application to perform some specific task. While
PIs are a small portion of the XML specification, they are important
enough to warrant a section in our discussion of XML acronyms. A PI
is distinguished from other XML data because it represents a command
to either the XML parser or a program that would use the XML
document. For example, in our sample XML document in Example 1.1, the first line, which indicates the version
of XML, is a processing instruction. It indicates to the parser what
version of XML is being used. Processing instructions are of the form
<?target
instructions?>
.
Any PI that has the target XML
is part of the XML
standard set of PIs that parsers should recognize, often called
XML instructions
, but PIs can also specify
information to be used by applications that may be wrapping the
parsing behavior; in this case, the wrapping application might have a
keyword (such as “cocoon”) that could be used as the
PI’s target.
Processing instructions become extremely important when XML data is used in XML-aware applications. As a more salient example, consider the application that might process our sample XML file and then create advertisements for a furniture store based on what stock is available and listed in the XML document. A processing instruction could let the application know that some furniture is on a “want” list and must be routed to another application, such as an application that sends requests for more inventory, and should not be included in the advertisement, or other application-specific instructions. An XML parser will see PIs with external targets and pass them on unchanged to the external application.
A
DTD is a
document type definition
. A DTD establishes a
set of constraints for an XML document (or a set of documents). DTD
is not a specification on its own, but is defined as part of the XML
specification. Within an XML document, a document type declaration
can both include markup constraints and refer to an external document
with markup constraints. The sum of these two sets of constraints is
the document type definition. A DTD defines the way an XML document
should be constructed. Consider the XML document in Example 1.1 again. Although we were able to create our own
tags, this document is useless to another application, or even
another human, who does not understand what our tags mean. Although
some common sense can help in determining what the tags mean, there
are still ambiguities. Can the <quantity>
tag tell us how many chairs are in stock? Can a
wood
attribute be specified within a
<chair>
tag? These questions must be
answered for the XML document to be properly validated by an XML
parser. A document is considered valid when it follows the
constraints that the DTD lays out for the formatting of XML data.
This is particularly important when trying to transfer data between
applications, as there must be an agreed-upon formatting and syntax
for different systems to understand each other.
Remember that earlier we said a DTD defined the constraints for a
specific XML document or set of documents. A developer or content
author also creates this DTD as an additional document referenced in
his or her XML files, or includes it within the XML file itself, so
it does not in any way limit the XML documents. In fact, the DTD is
what gives XML data its
portability. It might define that for the
wood
attribute, only “maple”,
“pine”, “oak”, and “mahogany” are
acceptable values. This allows a parser to determine if the document
is acceptable in its content, preventing data errors. A DTD also
defines the order of nesting in tags. It might dictate that the
<cushion>
tag can only appear nested within
the <chair>
tag. This allows another
application receiving our example XML file to know how to process and
search within the received file. The DTD is what adds portability to
an XML document’s extensibility, resulting not only in flexible
data, but data that can be processed and validated by any machine
that can locate the document’s DTD.
Namespaces is one
of the few XML-related concepts that has not been converted into an
acronym. It even has a name that describes its purpose! A
namespace is a mapping between an element
prefix and a URI. This mapping is used for
handling namespace collisions and defining data structures that allow
parsers to handle collisions. As an example of a possible namespace
collision, consider an XML document that might include a
<price>
tag for a chair, between a
<chair>
and
</chair>
tag. However, we also include in
the chair definition a <cushion>
tag, which
might also have a <price>
tag. Also consider
that the document may reference another XML document for copyright
information. Both documents could reasonably have
<date>
or possibly
<company>
tags. Conflicting tags such as
these result in ambiguity as to which tag means what. This ambiguity
creates significant problems for an XML parser. Should the
<price>
tag be interpreted differently
depending on which element is it within? Or did the content author
make a mistake in using it in two contexts? Without additional
namespace information, it is impossible to decide if this was an
error in the XML document construction, and if not, how to use the
data within the conflicting tags.
The XML namespace Recommendation defines a mechanism to qualify these
names. This mechanism uses URIs to perform this task, although this
is a little beyond what we need to know right now. In qualifying both
the correct usage and placement of tags like the
<price>
tag in our example, an XML document
is not forced to use rather foolish naming such as
<chair-price>
and
<cushion-price>
. Instead, a namespace is
associated with a prefix to an XML element, and results in tags such
as <chair:price>
and
<cushion:price>
. An XML parser can then
distinguish between these two namespaces without having to use
entirely different element names. Namespaces are most often used
within XML documents, but are also used in schemas and XSL
stylesheets, as well as other XML-related specifications. The
Recommendation for namespaces can be found at
http://www.w3.org/TR/REC-xml-names.
XSL is the
Extensible Stylesheet Language
. XSL transforms
and translates XML data from one XML format into another. Consider,
for example, that the same XML document may need to be displayed in
HTML, PDF, and Postscript form. Without XSL, the XML document would
have to be manually duplicated, and then converted into each of these
three formats. Instead, XSL provides a mechanism of defining
stylesheets to accomplish these types of tasks. Rather than having to
change the data because of a different representation, XSL provides a
complete separation of data, or content, and presentation. If an XML
document needs to be mapped to another representation, then XSL is an
excellent solution. It provides a method comparable to writing a Java
program to translate data into a PDF or HTML document, but supplies a
standard interface to accomplish the task.
To perform the translation, an XSL document can contain
formatting objects
. These formatting objects are specific
named tags that can be replaced with appropriate content for the
target document type. A common formatting object might define a tag
that some processor uses in the transformation of an XML document
into PDF; in this case, the tag would be replaced by PDF-specific
information. Formatting objects are specific XSL instructions, and
although we will lightly discuss them, they are largely beyond the
scope of this book. Instead, we will focus more on XSLT, a completely
text-based transformation process. Through the process of XSLT
(Extensible Stylesheet Language Transformation
),
an XSL textual stylesheet and a textual XML document are
“merged” together, and what results is the XML data
formatted according to the XSL stylesheet. To help clarify this
difficult concept further, let’s look at another sample XML
file, shown in Example 1.2.
Example 1-2. Another Sample XML File
<?xml version="1.0"?> <?xml-stylesheet href="hello.xsl" type="text/xsl"?> <!-- Here is a sample XML file --> <page> <title>Test Page</title> <content> <paragraph>What you see is what you get!</paragraph> </content> </page>
This document defines itself as XML version 1.0, and then defines the
location of a corresponding XSL stylesheet,
hello.xsl
. This is similar to the way in which
DTDs are used; just as a DTD can be referenced in XML to define how
the data can be structured, an XSL file can be referenced to
determine how the data is presented and displayed. Example 1.3 looks at the XSL stylesheet that is referred
to.
Example 1-3. The Stylesheet for Example 1.2
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > <xsl:template match="page"> <html> <head> <title> <xsl:value-of select="title"/> </title> </head> <body bgcolor="#ffffff"> <xsl:apply-templates/> </body> </html> </xsl:template> <xsl:template match="paragraph"> <p align="center"> <i> <xsl:apply-templates/> </i> </p> </xsl:template> </xsl:stylesheet>
This stylesheet is designed to convert our basic XML document and its
data into HTML suitable for a web browser. While most of these
details are things we will discuss later, concentrate on the
<xsl:template
match="[element
name]">
tags. Any time this type of tag occurs, the element at the matching
tag, for example, paragraph
, is replaced by the
contents of the XSL stylesheet, which in this case results in a
<p>
tag with italicized font encoding. What
results from the transformation of the XML document by the XSL
stylesheet is shown in Example 1.4.
Example 1-4. HTML Result from Examples Example 1.2 and Example 1.3
<html> <head> <title> Test Page </title> </head> <body bgcolor="#ffffff"> <p align="center"> <i> What you see is what you get! </i> </p> </body> </html>
Don’t worry about understanding all of the specifics of XSL and
XSLT yet; just realize that using XML and XSL, highly flexible
document formats can result from the same set of underlying XML data.
We will spend more time on XSL in Chapter 6. XSL
is currently a W3C Working Draft. The Recommendations related to XSL
may be viewed online
at
http://www.w3.org/Style/XSL
.
XPath (XML Path
Language) is a specification in its own right, but is used heavily by
XSLT. The XPath specification defines how a specific item within an
XML document can be located. This is accomplished through referencing
specific
nodes
in the XML document; here, node
refers to any
piece of XML data, including elements, attributes, or textual data.
In the XPath specification, an XML document is considered a tree of
these nodes, where each node can be accessed by specifying the
location in the tree at which it is located. We won’t get into
details about using XPath until we discuss XSL and XSLT more, but
expect to use it anytime you must obtain a reference to a specific
piece of data within an XML document. To let you know what to expect,
here is a sample XPath expression:
*[not(self::JavaXML:Title)]
This particular expression evaluates to all child elements of the
current element, where the child’s name is not
JavaXML:Title
. For this document fragment:
<JavaXML:Book> <JavaXML:Title>Java and XML</JavaXML:Title> <JavaXML:Content> <!-- Chapters go here --> </JavaXML:Content> <JavaXML:Copyright>&OReillyCopyright;</JavaXML:Copyright> </JavaXML:Book>
evaluating the expression when the current node is the
JavaXML:Book
element would yield the
JavaXML:Content
and
JavaXML:Copyright
elements. The complete XPath
specification is online at
http://www.w3.org/TR/xpath.
XML Schema is designed to replace and amplify DTDs. XML Schema offers an XML-centric means to constrain XML documents. Though we have only looked briefly at DTDs so far, they have some rather critical limitations: they have no knowledge of hierarchy, they have difficulty handling namespace conflicts, and they have no means of specifying allowed relationships between XML documents. This is understandable, as the members of the working group who wrote the specification certainly had no idea that XML would be used in so many different ways! However, the limitations of DTDs have become constricting to XML authors and developers.
The most significant fact about XML Schema is that it brings DTDs back into line with XML itself. That may sound confusing; consider, though, that every acronym we have talked about uses XML documents to define its purpose. XSL stylesheets, namespaces, and the rest all use XML to define specific uses and properties of XML. But a DTD is entirely different. A DTD does not look like XML, it does not share XML’s hierarchical structure, and it does not even represent data in the same way. This makes the DTD a bit of an oddball in the XML world, and because DTDs currently define how XML documents must be constructed, this has been causing some confusion. XML Schema corrects this problem by returning to using XML itself to define XML. We have been talking about “defining data about data” a lot, and XML Schema does this as well. The XML Schema specification moves XML a lot closer to having all of its constructs in the same language, rather than having DTDs as an aberration that has to be dealt with.
Wisely, the W3C and XML contributors realized that to refine DTD would be somewhat of a wasted effort. Instead, XML Schema is being developed to replace DTD, allowing these contributors to correct problems that DTD could not handle, as well as add enhancements in line with the various ways in which XML is currently being used. To learn more about this important W3C draft, visit http://www.w3.org/TR/xmlschema-1/ and http://www.w3.org/TR/xmlschema-2/. A helpful primer on XML Schema is located at http://www.w3.org/TR/xmlschema-0/.
XQL is a query language
designed to allow XML document formats to easily
represent database queries. Although not yet formally
adopted by the W3C, XQL’s popularity and usefulness will almost
certainly make it the de facto method for
specifying access to data stored in a database from an XML document.
The structure of a query is defined using XPath concepts, and the
result set is defined using standard XML with XQL-specific tags. For
example, the following XQL expression would search through the
books
table and return all records where the title
contains “Java”; for each record, the author records
(from the authors
table) would be displayed:
//book[title contains "Java"] ( .//authors )
The result set from this query might look like the following:
<xql:result> <book> <author name="Richard Monson-Haefel" location="Minnesota" /> </book> <book> <author name="Jason Hunter" location="California" /> <author name="William Crawford" location="Massachusetts" /> </book> </xql:result>
There will most likely be quite a bit of change as the specification matures and is hopefully adopted by the W3C, but XQL is a technology worth keeping an eye on. The current proposal for XQL is at http://metalab.unc.edu/xql/xql-proposal.html. This proposal made its way to the W3C in January of 2000, and current requirements for the XML Query language can be found at http://www.w3.org/TR/xmlquery-req.
You have now been sped through a very brief introduction of some of the major XML-related specifications we will cover. You can probably think of one or two acronyms we didn’t cover, if not more. We have selected only the particular acronyms that are especially relevant to our discussions on handling XML within Java. There are quite a few more, and they are listed here with the URLs for the appropriate recommendations or working drafts:
Resource Description Framework (RDF): http://www.w3.org/TR/PR-rdf-schema/
XLink: http://www.w3.org/TR/xlink/
XPointer: http://www.w3.org/TR/xptr/
This list will probably be outdated by the time you read this chapter, as more XML-based ideas are being examined and proposed every day. Just because these are not given significant time or space in this book, it should not make you think they are somehow less important; they are just not as critical to our discussions on manipulating XML data within Java. A complete understanding and mastery of XML certainly would require these specifications to be absorbed as well as those we have discussed in more detail. We still are likely to run across some of the specifications we have listed here; when that occurs, a definition and discussion will be provided in the text to help you understand what we are talking about.
Get Java and XML now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.