XML provides the syntax necessary to create your own vocabulary or dialect of XML. Here are a few things you need to know about namespaces and schemas.
One of the best things about XML is that you can create your own tags—a vocabulary or dialect—if you want. To create a vocabulary, you should understand a couple of things about schemas and namespaces. You can use XML without schemas or namespaces, but sometimes you want to use one, the other, or both. This hack explains when you’ll want to use schemas and namespaces and when you’ll want to avoid them.
XML documents must be well-formed. This means that they must adhere to the syntax defined in the XML specification (http://www.w3.org/TR/REC-xml/). This syntax mandates such things as matching case in tag names, matching quotes around attribute values, restrictions on what Unicode characters may be used, and so on.
An XML document may also be valid. This means that such a document must conform to the restrictions laid out in an associated schema. Basically, a schema declares or defines what elements and attributes are allowed in a valid instance, including in what order the elements may appear. Governing document layout with schemas can greatly increase the reliability, consistency, and accuracy of exchanged documents.
The native schema language of XML is the document type definition or DTD [Hack #68] , which is part of the XML specification and which XML inherited, in simplified form, from SGML. The document valid.xml in Example 1-7 uses a document type declaration (shown in boldface) to associate a DTD with itself.
Example 1-7. valid.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE time SYSTEM "time.dtd">
<!-- a time instant -->
<time timezone="PST">
<hour>11</hour>
<minute>59</minute>
<second>59</second>
<meridiem>p.m.</meridiem>
<atomic signal="true"/>
</time>
The document type declaration states that the document element for
valid.xml is the time
element
and that it is an instance of the DTD time.dtd.
SYSTEM
indicates that the DTD will be found as
indicated in the filename that follows, in this case, relative to the
location of valid.xml (in this case, in the same
directory). The simple DTD
time.dtd
is shown in Example 1-8.
Example 1-8. time.dtd
<!ELEMENT time (hour,minute,second,meridiem,atomic)> <!ATTLIST time timezone CDATA #REQUIRED> <!ELEMENT hour (#PCDATA)> <!ELEMENT minute (#PCDATA)> <!ELEMENT second (#PCDATA)> <!ELEMENT meridiem (#PCDATA)> <!ELEMENT atomic EMPTY> <!ATTLIST atomic signal CDATA #REQUIRED>
The DTD is not written in XML syntax: it has its own structural
rules. This DTD uses what are called markup
declarations
for elements and attributes to
spell out how the elements and attributes should appear in an
instance. For example, the element declaration on line 1 indicates
that the time
element will contain only child
elements, and that exactly one occurrence of each of these child
elements will appear in the exact order hour
,
minute
, second
,
meridiem
, and atomic
.
The element declaration on line 3 tells the XML processor that the
hour
element will contain parsed character data or
text (the same goes for the minute
,
second
, and meridiem
elements
declared on lines 4, 5, and 6). The atomic
element
is declared empty on line 7 (no content).
Two attributes are declared on lines 2 and 8. These
attribute-list
declarations
, probably so called because you
can list more than one attribute at a time, first name the element
that is linked to the attribute (time
with
timezone
, atomic
with
signal
), followed by the kind of value allowed for
the attribute (CDATA is text, basically). Finally, a token is given
that indicates that the attribute is required and must appear on the
element (#REQUIRED
).
XML Schema
[Hack #69]
was
developed by the W3C, reaching recommendation status in May 2001
(http://www.w3.org/XML/Schema).
Written in XML, it is a grammar-based schema language that aims to
provide more expressive power than DTDs, which it succeeds in doing
to a degree. One of the most popular features of XML Schema is
extensive datatypes (http://www.w3.org/TR/xmlschema-2/). DTDs
offer less than 10 types for attributes only, but XML Schema provides
a broad range of standard types—string
,
date
, boolean
,
integer
, and byte
, to name a
few—for both elements and attributes. XML
Schema’s structures recommendation (http://www.w3.org/TR/xmlschema-1), where its
elements and attributes are specified, is long and very complex, even
somewhat obfuscated, and mournfully so. It is widely used because of
the W3C imprimatur, though other schema languages seem more popular
in certain circles. Take RELAX
NG
[Hack #72]
, for example.
RELAX NG (http://www.relaxng.org and http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=relax-ng) is also a grammar-based schema language, and was developed by James Clark and Murata Makoto at OASIS (http://www.oasis-open.org). It is a remarkably intuitive language that is easy to grasp yet has sound mathematical underpinnings, which makes it very popular with users and developers alike. It has great expressive power; for example, you can do things like validate interleaved elements that can appear in any order. It’s modular, too: it implements XML Schema’s datatypes, for instance, though you can implement your own datatypes if you like. RELAX NG has recently become an ISO standard as part 2 of the Document Schema Definition Language or DSDL (ISO/IEC 19757-2:2003 Information technology—Document Schema Definition Language (DSDL)—Part 2: Regular-grammar-based validation—RELAX NG). Search for DSDL at http://www.iso.ch.
Unlike its grammar-based cousins, Schematron [Hack #77] (http://www.schematron.com) is an assertion-based language that works well with other schema languages. As its creator Rick Jelliffe has said, it’s the feather duster that reaches into the corners of documents that other languages can’t reach. Assertions are expressed as paths, and reference implementations for Schematron are written in XSLT, a natural language for analyzing paths. Along with RELAX NG, Schematron is being standardized as part of ISO’s DSDL.
There are other schema languages for XML, but DTDs, XML Schema, RELAX NG, and Schematron are the most popular.
Namespaces
in XML provide a way to disambiguate names in XML documents, thus
helping avoid the collision of names when one or more vocabularies
are combined in a document. The following document,
namespace.xml
(Example 1-9), shows a default namespace
declaration, which declares a namespace for the document element
time
and all of its children using the special
attribute xmlns
and the URI value
http://www.wyeast.net/time
.
Example 1-9. namespace.xml
<?xml version="1.0" encoding="UTF-8"?>
<!-- a time instant -->
<time timezone="PST" xmlns="http://www.wyeast.net/time">
<hour>11</hour>
<minute>59</minute>
<second>59</second>
<meridiem>p.m.</meridiem>
<atomic signal="true"/>
</time>
You can also use a prefix with a namespace instead of a default
namespace declaration. This is shown in
prefix.xml (Example 1-10), which
associates the prefix tz
with namespace URI
http://www.wyeast.net/time
using the
xmlns:tz
attribute. Any child element of
tz:time
that doesn’t use the
prefix will not be in the
http://www.wyeast.net/time
namespace.
Example 1-10. prefix.xml
<?xml version="1.0" encoding="UTF-8"?>
<!-- a time instant -->
<tz:time timezone="PST" xmlns:tz="http://www.wyeast.net/time">
<tz:hour>11</tz:hour>
<tz:minute>59</tz:minute>
<tz:second>59</tz:second>
<tz:meridiem>p.m.</tz:meridiem>
<tz:atomic tz:signal="true"/>
</tz:time>
[Hack #59] goes into more depth about namespaces in XML.
XML Schema primer: http://www.w3.org/TR/xmlschema-0/
RELAX NG tutorial: http://www.oasis-open.org/committees/relax-ng/tutorial.html
Eddie Robertsson’s introduction to Schematron: http://www.xml.com/pub/a/2003/11/12/schematron.html
A presentation on namespaces by Simon St.Laurent: http://simonstl.com/articles/namespaces/
Get XML Hacks now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.