Choose Tools for Creating an XML Vocabulary

XML provides the syntax necessary to create your own vocabulary or dialect of XML. Here are a few things you need to know about namespaces and schemas.

One of the best things about XML is that you can create your own tags—a vocabulary or dialect—if you want. To create a vocabulary, you should understand a couple of things about schemas and namespaces. You can use XML without schemas or namespaces, but sometimes you want to use one, the other, or both. This hack explains when you’ll want to use schemas and namespaces and when you’ll want to avoid them.

Well-Formedness, Validation, and Schemas

XML documents must be well-formed. This means that they must adhere to the syntax defined in the XML specification (http://www.w3.org/TR/REC-xml/). This syntax mandates such things as matching case in tag names, matching quotes around attribute values, restrictions on what Unicode characters may be used, and so on.

An XML document may also be valid. This means that such a document must conform to the restrictions laid out in an associated schema. Basically, a schema declares or defines what elements and attributes are allowed in a valid instance, including in what order the elements may appear. Governing document layout with schemas can greatly increase the reliability, consistency, and accuracy of exchanged documents.

DTD

The native schema language of XML is the document type definition or DTD [Hack #68] , which is part of the XML specification and which XML inherited, in simplified form, from SGML. The document valid.xml in Example 1-7 uses a document type declaration (shown in boldface) to associate a DTD with itself.

Example 1-7. valid.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE time SYSTEM "time.dtd">
  
<!-- a time instant -->
<time timezone="PST">
 <hour>11</hour>
 <minute>59</minute>
 <second>59</second>
 <meridiem>p.m.</meridiem>
 <atomic signal="true"/>
</time>

The document type declaration states that the document element for valid.xml is the time element and that it is an instance of the DTD time.dtd. SYSTEM indicates that the DTD will be found as indicated in the filename that follows, in this case, relative to the location of valid.xml (in this case, in the same directory). The simple DTD time.dtd is shown in Example 1-8.

Example 1-8. time.dtd

<!ELEMENT time (hour,minute,second,meridiem,atomic)>

<!ATTLIST time timezone CDATA #REQUIRED>

<!ELEMENT hour (#PCDATA)>

<!ELEMENT minute (#PCDATA)>

<!ELEMENT second (#PCDATA)>

<!ELEMENT meridiem (#PCDATA)>

<!ELEMENT atomic EMPTY>

<!ATTLIST atomic signal CDATA #REQUIRED>

The DTD is not written in XML syntax: it has its own structural rules. This DTD uses what are called markup declarations for elements and attributes to spell out how the elements and attributes should appear in an instance. For example, the element declaration on line 1 indicates that the time element will contain only child elements, and that exactly one occurrence of each of these child elements will appear in the exact order hour, minute, second, meridiem, and atomic.

The element declaration on line 3 tells the XML processor that the hour element will contain parsed character data or text (the same goes for the minute, second, and meridiem elements declared on lines 4, 5, and 6). The atomic element is declared empty on line 7 (no content).

Two attributes are declared on lines 2 and 8. These attribute-list declarations , probably so called because you can list more than one attribute at a time, first name the element that is linked to the attribute (time with timezone, atomic with signal), followed by the kind of value allowed for the attribute (CDATA is text, basically). Finally, a token is given that indicates that the attribute is required and must appear on the element (#REQUIRED).

Other schema languages

XML Schema [Hack #69] was developed by the W3C, reaching recommendation status in May 2001 (http://www.w3.org/XML/Schema). Written in XML, it is a grammar-based schema language that aims to provide more expressive power than DTDs, which it succeeds in doing to a degree. One of the most popular features of XML Schema is extensive datatypes (http://www.w3.org/TR/xmlschema-2/). DTDs offer less than 10 types for attributes only, but XML Schema provides a broad range of standard types—string, date, boolean, integer, and byte, to name a few—for both elements and attributes. XML Schema’s structures recommendation (http://www.w3.org/TR/xmlschema-1), where its elements and attributes are specified, is long and very complex, even somewhat obfuscated, and mournfully so. It is widely used because of the W3C imprimatur, though other schema languages seem more popular in certain circles. Take RELAX NG [Hack #72] , for example.

RELAX NG (http://www.relaxng.org and http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=relax-ng) is also a grammar-based schema language, and was developed by James Clark and Murata Makoto at OASIS (http://www.oasis-open.org). It is a remarkably intuitive language that is easy to grasp yet has sound mathematical underpinnings, which makes it very popular with users and developers alike. It has great expressive power; for example, you can do things like validate interleaved elements that can appear in any order. It’s modular, too: it implements XML Schema’s datatypes, for instance, though you can implement your own datatypes if you like. RELAX NG has recently become an ISO standard as part 2 of the Document Schema Definition Language or DSDL (ISO/IEC 19757-2:2003 Information technology—Document Schema Definition Language (DSDL)—Part 2: Regular-grammar-based validation—RELAX NG). Search for DSDL at http://www.iso.ch.

Unlike its grammar-based cousins, Schematron [Hack #77] (http://www.schematron.com) is an assertion-based language that works well with other schema languages. As its creator Rick Jelliffe has said, it’s the feather duster that reaches into the corners of documents that other languages can’t reach. Assertions are expressed as paths, and reference implementations for Schematron are written in XSLT, a natural language for analyzing paths. Along with RELAX NG, Schematron is being standardized as part of ISO’s DSDL.

There are other schema languages for XML, but DTDs, XML Schema, RELAX NG, and Schematron are the most popular.

Namespaces

Namespaces in XML provide a way to disambiguate names in XML documents, thus helping avoid the collision of names when one or more vocabularies are combined in a document. The following document, namespace.xml (Example 1-9), shows a default namespace declaration, which declares a namespace for the document element time and all of its children using the special attribute xmlns and the URI value http://www.wyeast.net/time.

Example 1-9. namespace.xml

<?xml version="1.0" encoding="UTF-8"?>
    
<!-- a time instant -->
<time timezone="PST" xmlns="http://www.wyeast.net/time">
 <hour>11</hour>
 <minute>59</minute>
 <second>59</second>
 <meridiem>p.m.</meridiem>
 <atomic signal="true"/>
</time>

You can also use a prefix with a namespace instead of a default namespace declaration. This is shown in prefix.xml (Example 1-10), which associates the prefix tz with namespace URI http://www.wyeast.net/time using the xmlns:tz attribute. Any child element of tz:time that doesn’t use the prefix will not be in the http://www.wyeast.net/time namespace.

Example 1-10. prefix.xml

<?xml version="1.0" encoding="UTF-8"?>
    
<!-- a time instant -->
<tz:time timezone="PST" xmlns:tz="http://www.wyeast.net/time">
 <tz:hour>11</tz:hour>
 <tz:minute>59</tz:minute>
 <tz:second>59</tz:second>
 <tz:meridiem>p.m.</tz:meridiem>
 <tz:atomic tz:signal="true"/>
</tz:time>

[Hack #59] goes into more depth about namespaces in XML.

See Also

Get XML Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.