BUY THIS BOOK

Safari Books Online

What is this?

Looking to Reprint this content?


XML Pocket Reference
XML Pocket Reference, Second Edition Extensible Markup Language

By Robert Eckstein
With Michel燙asabianca

Cover | Table of Contents


Table of Contents

Chapter 1: XML Pocket Reference
The Extensible Markup Language (XML) is a document-processing standard that is an official recommendation of the World Wide Web Consortium (W3C), the same group responsible for overseeing the HTML standard. Many expect XML and its sibling technologies to become the markup language of choice for dynamically generated content, including nonstatic web pages. Many companies are already integrating XML support into their products.
XML is actually a simplified form of Standard Generalized Markup Language (SGML), an international documentation standard that has existed since the 1980s. However, SGML is extremely complex, especially for the Web. Much of the credit for XML's creation can be attributed to Jon Bosak of Sun Microsystems, Inc., who started the W3C working group responsible for scaling down SGML to a form more suitable for the Internet.
Put succinctly, XML is a meta language that allows you to create and format your own document markups. With HTML, existing markup is static: <HEAD> and <BODY>, for example, are tightly integrated into the HTML standard and cannot be changed or extended. XML, on the other hand, allows you to create your own markup tags and configure each to your liking鈥攆or example, <HeadingA>, <Sidebar>, <Quote>, or <ReallyWildFont>. Each of these elements can be defined through your own document type definitions and stylesheets and applied to one or more XML documents. XML schemas provide another way to define elements. Thus, it is important to realize that there are no "correct" tags for an XML document, except those you define yourself.
While many XML applications currently support Cascading Style Sheets (CSS), a more extensible stylesheet specification exists, called the Extensible Stylesheet Language (XSL). With XSL, you ensure that XML documents are formatted the same way no matter which application or platform they appear on.
XSL consists of two parts: XSLT (transformations) and XSL-FO (formatting objects). Transformations, as discussed in this book, allow you to work with XSLT and convert XML documents to other formats such as HTML. Formatting objects are described briefly in Section 1.6.1.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Introduction
The Extensible Markup Language (XML) is a document-processing standard that is an official recommendation of the World Wide Web Consortium (W3C), the same group responsible for overseeing the HTML standard. Many expect XML and its sibling technologies to become the markup language of choice for dynamically generated content, including nonstatic web pages. Many companies are already integrating XML support into their products.
XML is actually a simplified form of Standard Generalized Markup Language (SGML), an international documentation standard that has existed since the 1980s. However, SGML is extremely complex, especially for the Web. Much of the credit for XML's creation can be attributed to Jon Bosak of Sun Microsystems, Inc., who started the W3C working group responsible for scaling down SGML to a form more suitable for the Internet.
Put succinctly, XML is a meta language that allows you to create and format your own document markups. With HTML, existing markup is static: <HEAD> and <BODY>, for example, are tightly integrated into the HTML standard and cannot be changed or extended. XML, on the other hand, allows you to create your own markup tags and configure each to your liking鈥攆or example, <HeadingA>, <Sidebar>, <Quote>, or <ReallyWildFont>. Each of these elements can be defined through your own document type definitions and stylesheets and applied to one or more XML documents. XML schemas provide another way to define elements. Thus, it is important to realize that there are no "correct" tags for an XML document, except those you define yourself.
While many XML applications currently support Cascading Style Sheets (CSS), a more extensible stylesheet specification exists, called the Extensible Stylesheet Language (XSL). With XSL, you ensure that XML documents are formatted the same way no matter which application or platform they appear on.
XSL consists of two parts: XSLT (transformations) and XSL-FO (formatting objects). Transformations, as discussed in this book, allow you to work with XSLT and convert XML documents to other formats such as HTML. Formatting objects are described briefly in Section 1.6.1.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Terminology
Before we move further, we need to standardize some terminology. An XML document consists of one or more elements. An element is marked with the following form:
<Body>
This is text formatted according to the Body element
</Body>.
This element consists of two tags: an opening tag, which places the name of the element between a less-than sign (<) and a greater-than sign (>), and a closing tag, which is identical except for the forward slash (/) that appears before the element name. Like HTML, the text between the opening and closing tags is considered part of the element and is processed according to the element's rules.
Elements can have attributes applied, such as the following:
<Price currency="Euro">25.43</Price>
Here, the attribute is specified inside of the opening tag and is called currency. It is given a value of Euro, which is placed inside quotation marks. Attributes are often used to further refine or modify the default meaning of an element.
In addition to the standard elements, XML also supports empty elements. An empty element has no text between the opening and closing tags. Hence, both tags can (optionally) be combined by placing a forward slash before the closing marker. For example, these elements are identical:
<Picture src="blueball.gif"></Picture>
<Picture src="blueball.gif"/>
Empty elements are often used to add nontextual content to a document or provide additional information to the application that parses the XML. Note that while the closing slash may not be used in single-tag HTML elements, it is mandatory for single-tag XML empty elements.
Whereas HTML browsers often ignore simple errors in documents, XML applications are not nearly as forgiving. For the HTML reader, there are a few bad habits from which we should dissuade you:
XML is case-sensitive
Element names must be used exactly as they are defined. For example,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Reference
Now that you have had a quick taste of working with XML, here is an overview of the more common rules and constructs of the XML language.
These are the rules for a well-formed XML document:
  • All element attribute values must be in quotation marks.
  • An element must have both an opening and a closing tag, unless it is an empty element.
  • If a tag is a standalone empty element, it must contain a closing slash (/) before the end of the tag.
  • All opening and closing element tags must nest correctly.
  • Isolated markup characters are not allowed in text; < or & must use entity references. In addition, the sequence ]]> must be expressed as ]]&gt; when used as regular text. (Entity references are discussed in further detail later.)
  • Well-formed XML documents without a corresponding DTD must have all attributes of type CDATA by default.
XML uses the following special markup constructs.
<?xml ...?>
<?xml version="number" [encoding="encoding"] [standalone="yes|no"] ?>
Although they are not required to, XML documents typically begin with an XML declaration, which must start with the characters <?xml and end with the characters ?>. Attributes include:
version
The version attribute specifies the correct version of XML required to process the document, which is currently
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Entity and Character References
Entity references are used as substitutions for specific characters (or any string substitution) in XML. A common use for entity references is to denote document symbols that might otherwise be mistaken for markup by an XML processor. XML predefines five entity references for you, which are substitutions for basic markup symbols. However, you can define as many entity references as you like in your own DTD. (See the next section.)
Entity references always begin with an ampersand (&) and end with a semicolon (;). They cannot appear inside a CDMS but can be used anywhere else. Predefined entities in XML are shown in the following table:
EntityCharNotes
&amp; &Do not use inside processing instructions.
&lt; <Use inside attribute values quoted with ".
&gt; > Use after ]] in normal text and inside processing instructions.
&quot; "Use inside attribute values quoted with ".
&apos; 'Use inside attribute values quoted with '.
In addition, you can provide character references for Unicode characters with a numeric character reference. A decimal character reference consists of the string &#, followed by the decimal number representing the character, and finally, a semicolon (;). For hexadecimal character references, the string &#x is followed first by the hexadecimal number representing the character and then a semicolon. For example, to represent the copyright character, you could use either of the following lines:
This document is &#169; 2001 by O'Reilly and Assoc.
This document is &#xA9; 2001 by O'Reilly and Assoc.
The character reference is replaced with the "circled-C" (漏) copyright character when the document is formatted.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Document Type Definitions
A DTD specifies how elements inside an XML document should relate to each other. It also provides grammar rules for the document and each of its elements. A document adhering to the XML specifications and the rules outlined by its DTD is considered to be valid. (Don't confuse this with a well-formed document, which adheres only to the XML syntax rules outlined earlier.)
You must declare each of the elements that appear inside your XML document within your DTD. You can do so with the <!ELEMENT> declaration, which uses this format:
<!ELEMENT elementname rule>
This declares an XML element and an associated rule called a content model, which relates the element logically to the XML document. The element name should not include <> characters. An element name must start with a letter or an underscore. After that, it can have any number of letters, numbers, hyphens, periods, or underscores in its name. Element names may not start with the string xml in any variation of upper- or lowercase. You can use a colon in element names only if you use namespaces; otherwise, it is forbidden.
The simplest element declaration states that between the opening and closing tags of the element, anything can appear:
<!ELEMENT library ANY>
The ANY keyword allows you to include other valid tags and general character data within the element. However, you may want to specify a situation where you want only general characters to appear. This type of data is better known as parsed character data, or PCDATA. You can specify that an element contain only PCDATA with a declaration such as the following:
<!ELEMENT title (#PCDATA)>
Remember, this declaration means that any character data that is not an element can appear between the element tags. Therefore, it's legal to write the following in your XML document:
<title></title>
<title>XML Pocket Reference</title>
<title>Java Network Programming</title>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Extensible Stylesheet Language
The Extensible Stylesheet Language (XSL) is one of the most intricate specifications in the XML family. XSL can be broken into two parts: XSLT, which is used for transformations, and XSL Formatting Objects (XSL-FO). While XSLT is currently in widespread use, XSL-FO is still maturing; both, however, promise to be useful for any XML developer.
This section will provide you with a firm understanding of how XSL is meant to be used. For the very latest information on XSL, visit the home page for the W3C XSL working group at http://www.w3.org/Style/XSL/.
As we mentioned, XSL works by applying element-formatting rules that you define for each XML document it encounters. In reality, XSL simply transforms each XML document from one series of element types to another. For example, XSL can be used to apply HTML formatting to an XML document, which would transform it from:
<?xml version="1.0"?>
<OReilly:Book title="XML Comments">
 <OReilly:Chapter title="Working with XML">
  <OReilly:Image src="http://www.oreilly.com/1.gif"/>
  <OReilly:HeadA>Starting XML</OReilly:HeadA>
  <OReilly:Body>
    If you haven't used XML, then ...
  </OReilly:Body>
 </OReilly:Chapter>
</OReilly:Book>
to the following HTML:
<HTML>
  <HEAD>
   <TITLE>XML Comments</TITLE>
  </HEAD>
  <BODY>
   <H1>Working with XML</H1>
   <img src="http://www.oreilly.com/1.gif"/>
   <H2>Starting XML</H2>
   <P>If you haven't used XML, then ...</P>
  </BODY>
</HTML>
If you look carefully, you can see a predefined hierarchy that remains from the source content to the resulting content. To venture a guess, the <OReilly:Book> element probably maps to the <HTML>, <HEAD>, <TITLE>, and <BODY> elements in HTML. The <OReilly:Chapter> element maps to the HTML <H1> element, the <OReilly:Image> element maps to the <img> element, and so on.
This demonstrates an essential aspect of XML: each document contains a hierarchy of elements that can be organized in a tree-like fashion. (If the document uses a DTD, that hierarchy is well defined.) In the previous XML example, the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XSLT Stylesheet Structure
The general order for elements in an XSL stylesheet is as follows:
<xsl:stylesheet version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   <xsl:import/>
   <xsl:include/>
   <xsl:strip-space/>
   <xsl:preserve-space/>
   <xsl:output/>
   <xsl:key/>
   <xsl:decimal-format/>
   <xsl:namespace-alias/>
   <xsl:attribute-set>...</xsl:attribute-set>
   <xsl:variable>...</xsl:variable>
   <xsl:param>...</xsl:param>

   <xsl:template match="...">
      ...
   </xsl:template>

   <xsl:template name="...">
      ...
   </xsl:template>

</xsl:stylesheet>
Essentially, this ordering boils down to a few simple rules. First, all XSL stylesheets must be well-formed XML documents, and each <XSL> element must use the namespace specified by the xmlns declaration in the <stylesheet> element (commonly xsl:). Second, all XSL stylesheets must begin with the XSL root element tag, <xsl:stylesheet>, and close with the corresponding tag, </xsl:stylesheet>. Within the opening tag, the XSL namespace must be defined:
<xsl:stylesheet
   version="1.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
After the root element, you can import external stylesheets with <xsl:import> elements, which must always be first within the <xsl:stylesheet> element. Any other elements can then be used in any order and in multiple occurrences if needed.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Templates and Patterns
An XSLT stylesheet transforms an XML document by applying templates for a given type of node. A template element looks like this:
<xsl:template match="pattern"> 
   ...
</xsl:template>
where pattern selects the type of node to be processed.
For example, say you want to write a template to transform a <para> node (for paragraph) into HTML. This template will be applied to all <para> elements. The tag at the beginning of the template will be:
<xsl:template match="para">
The body of the template often contains a mix of "template instructions" and text that should appear literally in the result, although neither are required. In the previous example, we want to wrap the contents of the <para> element in <p> and </p> HTML tags. Thus, the template would look like this:
<xsl:template match="para">
   <p><xsl:apply-templates/></p>
</xsl:template>
The <xsl:apply-templates/> element recursively applies all other templates from the stylesheet against the <para> element (the current node) while this template is processing. Every stylesheet has at least two templates that apply by default. The first default template processes text and attribute nodes and writes them literally in the document. The second default template is applied to elements and root nodes that have no associated namespace. In this case, no output is generated, but templates are applied recursively from the node in question.
Now that we have seen the principle of templates, we can look at a more complete example. Consider the following XML document:
<?xml version="1.0" encoding="iso-8859-1"?>

<!DOCTYPE text SYSTEM "example.dtd">

<chapter>
   <title>Sample text</title>
   <section title="First section">
      <para>This is the first section of the text.</para>
   </section>
   <section title="Second section">
      <para>This is the second section of the text.</para>
   </section>
</chapter>
To transform this into HTML, we use the following template:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XSLT Elements
The following list is an enumeration of XSLT elements.
<xsl:apply-imports>
<xsl:apply-imports/>
This styles the current node and each of its children using the imported stylesheet rules, ignoring those in the stylesheet that performed the import. Note that the rules don't apply to the current node's siblings or ancestors.
<xsl:apply-templates>
<xsl:apply-templates [select="node-set-expression"] [mode="mode"]/>
This specifies that the immediate children (default) or the selected nodes of the source element should be processed further. For example:
<xsl:template match="section">
     <B><xsl:apply-templates/><B>
</xsl:template>
This example processes the children of the selected <section> element after applying a bold tag. The optional select attribute determines which nodes should be processed:
<xsl:template match="section">
   <HR>
   <xsl:apply-templates
      select="paragraph (@indent)//sidebar"/>
   <HR>
   <xsl:apply-templates
      select="paragraph (@indent)/quote"/>
   <HR>
</xsl:template>
This example processes only specific children of the selected <section> element. In this case, the first target is a <sidebar> element that is a descendant of a <paragraph> element that has defined an indent attribute. The second target is a <quote> element that is the direct child of a <paragraph> element that has defined an indent attribute. The optional mode attribute causes only templates with a matching mode to be applied.
<xsl:attribute>
<xsl:attribute name="name" [namespace="namespace"]> ... </xsl:attribute>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XPath
XPath is a recommendation of the World Wide Web Consortium (W3C) for locating nodes in an XML document tree. XPath is not designed to be used alone but in conjunction with other tools, such as XSLT or XPointer. These tools use XPath intensively and extend it for their own needs through new functions and new basic types.
XPath provides a syntax for locating a node in an XML document. It takes its inspiration from the syntax used to denote paths in filesystems such as Unix. This node, often called the context node, depends on the context of the XPath expression. For example, the context of an XSLT expression found in an <xsl:template match="para"> template will be the selected <para> element (recall that XSLT templates use XPath expressions). This node can be compared to a Unix shell's current directory.
Given our earlier XML examples, it is possible to write the following expressions:
chapter
Selects the <chapter> element descendants of the context node
chapter/para
Selects the <para> element descendants of the <chapter> element children of the context node
../chapter
Selects the <chapter> element descendants of the parent of the context node
./chapter
Selects the <chapter> element descendants of the context node
*
Selects all element children of the context node
*/para
Selects the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XPointer and XLink
The final pieces of XML we cover are XPointer and XLink. These are separate standards in the XML family dedicated to working with XML links. Before we delve into them, however, we should warn you that the standards described here are not final as of publication time.
It's important to remember that an XML link is only an assertion of a relationship between pieces of documents; how the link is actually presented to a user depends on a number of factors, including the application processing the XML document.
To create a link, we must first have a labeling scheme for XML elements. One way to do this is to assign an identifier to specific elements we want to reference using an ID attribute:
<paragraph id="attack">
Suddenly the skies were filled with aircraft.
</paragraph>
You can think of IDs in XML documents as street addresses: they provide a unique identifier for an element within a document. However, just as there might be an identical address in a different city, an element in a different document might have the same ID. Consequently, you can tie together an ID with the document's URI, as shown here:
http://www.oreilly.com/documents/story.xml#attack
The combination of a document's URI and an element's ID should uniquely identify that element throughout the universe. Remember that an ID attribute does not need to be named id, as shown in the first example. You can name it anything you want, as long as you define it as an XML ID in the document's DTD. (However, using id is preferred in the event that the XML processor does not read the DTD.)
Should you give an ID to every element in your documents? No. Odds are that most elements will never be referenced. It's best to place IDs on items that a reader would want to refer to later, such as chapter and section divisions, as well as important items, such as term definitions.
The easiest way to refer to an ID attribute is with an
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Return to XML Pocket Reference