BUY THIS BOOK
Add to Cart

Print Book $29.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £20.95

What is this?

Looking to Reprint this content?


RELAX NG
RELAX NG

By Eric van der Vlist
Price: $29.95 USD
£20.95 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: What RELAX NG Offers
RELAX NG emerged from many years of XML development in an attempt to solve a variety of common problems raised in the creation and sharing of XML vocabularies. RELAX NG is not the only option for solving many of these problems, but the way in which it addresses them makes it an excellent candidate for many kinds of XML vocabulary development and processing.
I have heard people jest that XML stood for Excellent Marketing Language and I often felt that, unfortunately, this had become a very accurate definition. Nevertheless, the official meaning of XML is Extensible Markup Language, which remains slightly more accurate.
XML is extensible in the sense that it lets you define your own sets of elements and attributes which can be used to express virtually any hierarchical structure. The extensibility of XML has been widely used; some would even say overused. I've long since lost count of the different sets of XML elements and attributes (let's call them XML vocabularies) used by different people for different applications. Applications need to be able to tell whether documents conform to their expectations; this need creates a need for validation tools capable of representing and testing each of these vocabularies.
In the XML world, XML documents can live their own lives independently of programs: they can be edited, read, displayed, and transformed using generic tools independent of any particular application. It's also vitally important that they can be validated independently of any application. This validation requirement presents a serious challenge. The diversity of XML vocabularies is virtually infinite. We certainly don't want to limit XML's extensibility because of the tools used to validate XML documents. But that brings us to the next problem: there is diversity in what we can call validation.
This application independence raises some difficult issues in XML design and usage. Some people have focused on the surface parallels between XML document structures and object hierarchies. They say that XML is in the same paradigm for data as object orientation and that XML is a perfect serialization format for object systems. While that assessment is not completely without basis, XML reintroduces a clean separation between data and processing. This is the complete opposite of the basic object-oriented principle of encapsulating both data and behavior into objects.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Diversity
I have heard people jest that XML stood for Excellent Marketing Language and I often felt that, unfortunately, this had become a very accurate definition. Nevertheless, the official meaning of XML is Extensible Markup Language, which remains slightly more accurate.
XML is extensible in the sense that it lets you define your own sets of elements and attributes which can be used to express virtually any hierarchical structure. The extensibility of XML has been widely used; some would even say overused. I've long since lost count of the different sets of XML elements and attributes (let's call them XML vocabularies) used by different people for different applications. Applications need to be able to tell whether documents conform to their expectations; this need creates a need for validation tools capable of representing and testing each of these vocabularies.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Keeping Documents Independent of Applications
In the XML world, XML documents can live their own lives independently of programs: they can be edited, read, displayed, and transformed using generic tools independent of any particular application. It's also vitally important that they can be validated independently of any application. This validation requirement presents a serious challenge. The diversity of XML vocabularies is virtually infinite. We certainly don't want to limit XML's extensibility because of the tools used to validate XML documents. But that brings us to the next problem: there is diversity in what we can call validation.
This application independence raises some difficult issues in XML design and usage. Some people have focused on the surface parallels between XML document structures and object hierarchies. They say that XML is in the same paradigm for data as object orientation and that XML is a perfect serialization format for object systems. While that assessment is not completely without basis, XML reintroduces a clean separation between data and processing. This is the complete opposite of the basic object-oriented principle of encapsulating both data and behavior into objects.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Validation Has Many Aspects
Validation can be about checking the structure of XML documents. It can be about checking the content of each text node and attribute independently of each other (datatype checking). It can be about checking constraints on relationships between nodes. It can be about checking constraints between nodes and external information such as lookup tables or links. It can be about checking business rules. Taken liberally, it can be almost anything else, even spell checking.
All of these aspects are important for improving the level of quality of XML-based information systems. I recently heard two presentations about two independent projects in very different domains. Both came out with this alarming ratio: one out of ten real-world XML documents contains errors. With such a high proportion, validation is not only useful but indispensable! The word "alarming" is not overstating the case—imagine a banking system where 10% of the transactions contain errors. Calling validation important, therefore, is an understatement.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Best Way to Validate XML Document Structures
RELAX NG won't solve all issues by itself. It isn't designed to solve every conceivable validation problem. RELAX NG is, however, designed to be the best tool to solve two key pieces of the problem: validating the structure of XML document and providing a connection to datatype libraries that validate the content of text nodes and attributes. It's also designed to be used as a part of the ISO DSDL framework, which deals with the larger issues surrounding validation. (DSDL is described in Appendix A).
This tight focus makes RELAX NG very different from its main rival, W3C XML Schema. One of the reasons for the complexity of W3C XML Schema is that it includes many features that have been kept out of RELAX NG. W3C XML Schema cares not only about validating the structure of XML documents, but also about validating the content of text nodes and attributes and checking the integrity between keys and references. More importantly, W3C XML Schema addresses many issues beyond validation. It attempts to be a modeling language that can classify the elements and attributes of XML documents, identify their semantics, use these semantics as extensible object-like models, and perform automatic binding between XML documents and objects. All these goals are admirable, but too many of them are stuffed into a single technology.
During the development of RELAX NG, XML structure validation remained the focus. No compromises were made in deference to other features. The result is that RELAX NG appears to be the logical successor of XML DTDs and the best tool available to validate the structure of XML documents. RELAX NG's expressive power is such that virtually any XML vocabulary may be described with RELAX NG. That isn't true of W3C XML Schema, nor of DTDs. Perhaps most important for people who have to write schemas, RELAX NG is also very simple: because it does less, the syntax is intuitive. It has been kept simple. It isn't cluttered with complex limitations that take too much time to learn and remember.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RELAX NG's Diverse Applications
RELAX NG's tight focus doesn't mean that RELAX NG is a niche language meant to be limited to its original goal. RELAX NG may well follow the path of XSLT (also developed by James Clark). While XSLT's development was focused strictly on document transformation for formatting, it has become the Swiss Army knife of XML developers. XSLT use has gone well beyond its expected boundaries, in large part because it solves key problems effectively.
The same will likely happen with RELAX NG.
Recently, I had to write a converter for a flat, non-XML format into XML. The structure of the resulting document was described by a non-RELAX NG schema. After various hacks to map the 400 different bits of information of this flat structure into elements and attributes, I found that the easiest way to map them to XML was by using a RELAX NG schema.
I transformed the schema of the destination XML vocabulary into a simple RELAX NG schema. A Python program then walked through that structure, parsing the flat document and dispatching the information items to where they belonged. This was made easy by the uncluttered simplicity of the syntax of RELAX NG. The process would have taken much more time with any other schema language.
Another example is taken from RELAX NG itself. As you will discover in Chapter 4, a non-XML compact syntax is available for RELAX NG. This syntax is defined using an EBNF (Extended Backus-Naur Form) grammar. Knowing James Clark, I was sure he had generated it from XML. When I wrote the reference guide for this syntax (Chapter 18), I asked him to send me the source of this grammar as XML. I was expecting a format like the DocBook EBNF module, but instead, of course, he sent a RELAX NG schema! The syntax of RELAX NG is flexible enough to describe the productions of an EBNF grammar. Chapter 18 was generated using this schema. It's a summary that doesn't completely respect the semantics and restrictions of RELAX NG, but RELAX NG is still a useful way to describe this EBNF.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RELAX NG as a Pivot Format
These last two examples are a little bit extreme, but nevertheless RELAX NG appears to be the perfect pivot format for tasks related to XML schema work in any schema language, providing a useful common ground that developers can use to convert material between various schema forms. Kohsuke Kawaguchi's work on the Sun Multi-Schema Validator (MSV) takes advantage of this capability. Kawaguchi explained that the grammar-based schema languages supported by MSV (DTDs, RELAX NG, Relax and W3C XML Schema) were all translated into a common data model by the validator. The validation algorithm relied on this single data model. That data model is simply RELAX NG. This clearly demonstrates that the expressive power of RELAX NG is so useful and flexible that 99% of the constraints that can be described with other schema languages can be described with RELAX NG.
RELAX NG's advantages can also be a major drawback: if RELAX NG has so much more expressive power than other languages, it could mean that a schema written with RELAX NG would be impossible to translate.
Fortunately, this issue is more theoretical than practical. Although there are situations in which RELAX NG can't be translated into W3C XML Schema, they aren't likely to happen often in real-life schemas. If you can imagine a situation in which it would happen in real life, you can always balance your need to express such a schema in RELAX NG against your need to be able to publish a W3C XML Schema schema. I am confident that most RELAX NG schemas can be translated into other schema languages—even automatically. James Clark has developed Trang, a magic tool that takes a RELAX NG schema and converts it into W3C XML Schema or a DTD (http://www.thaiopensource.com/relaxng/trang.html).
RELAX NG's structures support both creation by hand and by auto-generation. RELAX NG can support the growing number of applications that generate their schemas from logical models using high levels of abstraction rather creating them from scratch. Whether you are using as your design tool UML, a simple spreadsheet such as the OASIS UBL project, or sample documents like my Examplotron, it's easier to derive a RELAX NG schema than to derive a schema using any other schema language.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Use Other Schema Languages?
There are tools to convert to and from other schema languages; however RELAX NG is easier to write, it's easier to generate, and it's easier for applications to use. As far as validation is concerned, I see no good reason to use another tool. Even if your tools support only XML 1.0 DTDs or W3C XML Schema, you can automatically generate those formats from RELAX NG.
RELAX NG is still a little bit behind, however, in datatype assignment and data binding. Datatype assignment appears to be increasingly important for a whole set of applications, including many new features of the XPath 2.0, XSLT 2.0, and XQuery 1.0 family of future W3C recommendations. Because datatype assignment was out of the scope for RELAX NG during its development, RELAX NG is very permissive about nondeterministic schemas. This permissiveness can lead to unpredictable type assignment during processing. This is something worth keeping in mind when writing RELAX NG schemas that will later be transformed into W3C XML Schema schemas. I will explain this subject in detail in Chapter 16.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Simple Foundations Are Beautiful
RELAX NG is built using a set of simple pieces. Before proceeding into the details of how RELAX NG assembles these pieces, it's worth exploring what these pieces are and what they'll contribute.
RELAX NG is an XML-based technology. RELAX NG schemas are commonly stored in XML documents (called schema documents ) and used to validate other XML documents (called instance documents ). While RELAX NG works with and uses XML documents, RELAX NG processors operate at a slightly higher level of abstraction, called an infoset , rather than processing the actual text of the XML document, which is called lexical processing .
An infoset is a logical view of the XML document, rather than the document as stored in a text file. Most XML processors read (or generate) XML syntax but work internally on a representation that omits a lot of details. To take a brief example, from a lexical perspective, which looks at the actual contents of an XML document, <book id='b0836217462' available="true"/> is an empty tag containing two attributes named id and available. The value of id is delimited with single quotes, while the value of available is delimited with double quotes. Yet, from an infoset perspective, this isn't an empty tag with particular syntax; the kind of quotation marks don't matter. It's a book element with an attribute named id and a value of b0836217462, as well as an attribute named available with a value of true. Elements, attributes, and text are often referred to as nodes in this perspective, like nodes in an object tree.
There are a variety of different models for XML documents—specifications such as the Simple API for XML (SAX), the Document Object Model (DOM), and XPath all have slightly different takes on what an infoset is. As a first step toward coordinating these perspectives, the W3C created a Recommendation: the XML Information Set (Infoset), which is available at
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Documents and Infosets
RELAX NG is an XML-based technology. RELAX NG schemas are commonly stored in XML documents (called schema documents ) and used to validate other XML documents (called instance documents ). While RELAX NG works with and uses XML documents, RELAX NG processors operate at a slightly higher level of abstraction, called an infoset , rather than processing the actual text of the XML document, which is called lexical processing .
An infoset is a logical view of the XML document, rather than the document as stored in a text file. Most XML processors read (or generate) XML syntax but work internally on a representation that omits a lot of details. To take a brief example, from a lexical perspective, which looks at the actual contents of an XML document, <book id='b0836217462' available="true"/> is an empty tag containing two attributes named id and available. The value of id is delimited with single quotes, while the value of available is delimited with double quotes. Yet, from an infoset perspective, this isn't an empty tag with particular syntax; the kind of quotation marks don't matter. It's a book element with an attribute named id and a value of b0836217462, as well as an attribute named available with a value of true. Elements, attributes, and text are often referred to as nodes in this perspective, like nodes in an object tree.
There are a variety of different models for XML documents—specifications such as the Simple API for XML (SAX), the Document Object Model (DOM), and XPath all have slightly different takes on what an infoset is. As a first step toward coordinating these perspectives, the W3C created a Recommendation: the XML Information Set (Infoset), which is available at http://www.w3.org/TR/xml-infoset/. The XML Infoset defines an abstract model of XML documents that uses a hierarchical structure described in terms generic and neutral enough to be acceptable for use with a diverse range of specifications.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Different Types of Schema Languages
While the different schema languages all operate on infoset views of documents, they have chosen different ways of defining constraints:
  • Constraints may be expressed as rules. In Schematron, for instance, a schema is a set of rules like "the element named book must have an attribute named id and this attribute's content must match this specific rule...."
  • Constraints may be expressed as a thorough description of each element and attribute like DTDs and W3C XML Schema: "it's an element named book, and it has two attributes named id and available, which look like this...."
  • Constraints may be expressed as patterns . Patterns are used to match the structures of permissible elements, attributes, and text nodes, much as the regular expressions used in programming can be used to match characters in text. I will cover this third way of defining constraints in detail in this book because this is the method that RELAX NG uses.
The first XML schema language was the Document Type Definition (DTD), which was part of XML 1.0. DTDs provide more than just schema validation features—they include the definition of internal and external entities—but their schema features focus on describing elements. Every element and attribute used by the document type defined by the DTD must be described. Each element must have a content model, identifying which child elements or text nodes are allowed, as well as a list of permissible attributes, if any attributes are allowed. To avoid redundant declarations, DTD developers may use parameter entities, which describe larger pieces of content models and work like a kind of macro processing.
W3C XML Schema extends this foundation and defines several kind of components , including elements, attributes, datatypes, groups of elements, and groups of attributes. (Datatypes are containers for various kinds of content, from text to integers to dates.) The approach is still very focused on elements and attributes, which are clearly differentiated.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Simple Example
Let's take a look at an example. Figure 2-1 shows the book element with its two attributes and four different subelements:
Figure 2-1: A complete example of the book element
With a DTD and, to a lesser extent, with W3C XML Schema, you are stuck defining lists of attributes and elements you can't mix or combine. W3C XML Schema has introduced the concept of types, abstract descriptions that have no direct corollary in the contents of XML documents. Types provide descriptions of the contents of elements or attributes, but types still can't be freely combined together. This means that you can split the description of elements into blocks such as those shown in Figure 2-2, but can mix the blocks in a limited number of ways.
Figure 2-2: The blocks of the book element, seen from a W3C XML Schema perspective
RELAX NG patterns, however, can freely mix different types of nodes (elements, text and attributes). Figure 2-3 shows how, if you want to, you can use RELAX NG to split the definition of the book element into a first pattern composed of the attributes id, title, and author and the element character, and then a second pattern composed of the available attribute and the other character elements.
Figure 2-3: An alternate approach to the document structure, made possible with RELAX NG
The flexibility just demonstrated isn't only useful for combining complex patterns. It also maintains the simplicity desired by RELAX NG schema designers who don't need or want to learn a long list of design limitations that must be checked when they write and combine their schemas.
This generic concept of patterns is powerful enough to replace the specialized containers of DTDs and W3C XML Schema. RELAX NG has no need for (and no notion of ) specially reusable components. Elements, attributes, and types are all embedded in patterns. These patterns are the reusable building blocks of RELAX NG. They can be named, reused, and even redefined at will, combined through operators to group them or to provide alternatives among them.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Strong Mathematical Background
This pattern-based approach is both new and old. It's new in the sense that the idea of patterns has been applied to XML in RELAX and now in RELAX NG. It's old because it is the adaptation of techniques and theories developed around regular expressions in the 1960s. The name "RELAX," which stands for REgular LAnguage for XML, suggests this related nature. ("NG" stands for New Generation.) RELAX NG relies on both the strong mathematical theory underlying regular expressions and on additional work done by Murata Makoto, which adapts the mathematical concept of "hedges" to XML.
When I asked Murata Makoto, one of the fathers of RELAX NG, my first questions, he kindly pointed me to the details of his work. I was shocked to see that I had forgotten all the mathematics I had learned at school. I couldn't understand a word of it. Fortunately, I can assure you that you won't need to understand hedges or any of the other math behind RELAX NG. Nevertheless, it's very comforting to know that the schema language you are using has an elegant mathematical background. It ensures that the design will work, and work well. While the math behind it is difficult, the results it produces are surprisingly intuitive.
In keeping with its mathematical foundation, RELAX NG patterns are defined as logical operations performed on sets of XML structures. This gives the specification a formalism that removes any possibility of ambiguous interpretation. The lack of ambiguity is incredibly helpful for ensuring the interoperability of different implementations of RELAX NG.
The strong mathematical background of RELAX NG didn't mean that everything needed to be reinvented for RELAX NG implementers. On the contrary, the derivative algorithm used by James Clark in his Jing RELAX NG processor was inspired by work done in 1964 on the derivation of regular expressions. It recursively removes the nodes found in the instance documents from the patterns: the document is valid if the patterns left after the last node are all optional.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Patterns, and Only Patterns
In science, strong theories tend to be simple, yet have almost infinite potential for complexity in application. RELAX NG is, because of its simplicity, one of those theories that is easy to explain, easy to implement, and generic and flexible enough to meet the most stringent requirements.
I'll present the RELAX NG patterns throughout this book, but I'd like to make a brief introduction here. In RELAX NG, there are three basic patterns that match the three types of XML nodes:
  • Text nodes
  • Elements
  • Attributes
These basic patterns can be combined into ordered or nonordered groups and used in choices defining alternatives among several patterns. The cardinality of a pattern (i.e., the number of times it can appear in an instance document) can also be controlled. Text nodes can be also be constrained as data, which can be limited to particular datatypes and possibly be split into list items. Lastly, a whole set of features supports the creation of reusable libraries of patterns. Similar to patterns, name classes define sets of elements and attributes that can be used to open a schema and control where elements and attributes with unknown names may be included in the instance documents.
Some of these features have been defined to facilitate the work of writing RELAX NG schemas and go beyond the basic (sometimes called "atomic") patterns. To avoid complicating the basic model with these convenience features, the RELAX NG specification describes a simplification algorithm . This algorithm is used internally by RELAX NG processors to transform a full schema into a simpler form with fewer and simpler patterns. This algorithm is presented in Chapter 15.
RELAX NG doesn't pay attention to XML processing instructions and comments.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: First Schema
Throughout the book, we will work with variations of a document that describes a library. For a first project, we will create a map from the document to the RELAX NG constructs that will create your first RELAX NG schema.
Example 3-1 shows the instance document used throughout the book as a foundation for RELAX NG experimentation and development.
Example 3-1. Sample instance document
<?xml version="1.0"?>
 <library>
  <book id="b0836217462" available="true">
   <isbn>0836217462</isbn>
   <title xml:lang="en">Being a Dog Is a Full-Time Job</title>
   <author id="CMS">
    <name>Charles M Schulz</name>
    <born>1922-11-26</born>
    <died>2000-02-12</died>
   </author>
   <character id="PP">
    <name>Peppermint Patty</name>
    <born>1966-08-22</born>
    <qualification>bold, brash and tomboyish</qualification>
    </character>
   <character id="Snoopy">
    <name>Snoopy</name>
    <born>1950-10-04</born>
    <qualification>extroverted beagle</qualification>
   </character>
   <character id="Schroeder">
    <name>Schroeder</name>
    <born>1951-05-30</born>
    <qualification>brought classical music to the Peanuts strip</qualification>
   </character>
   <character id="Lucy">
    <name>Lucy</name>
    <born>1952-03-03</born>
    <qualification>bossy, crabby and selfish</qualification>
   </character>
  </book>
 </library>
In plain English, the document, shown in Example 3-1 can be described as having:
  • One library element composed of:
    • One of more book elements having:
      • An id attribute and an available attribute
      • An isbn element composed of text
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Getting Started
Example 3-1 shows the instance document used throughout the book as a foundation for RELAX NG experimentation and development.
Example 3-1. Sample instance document
<?xml version="1.0"?>
 <library>
  <book id="b0836217462" available="true">
   <isbn>0836217462</isbn>
   <title xml:lang="en">Being a Dog Is a Full-Time Job</title>
   <author id="CMS">
    <name>Charles M Schulz</name>
    <born>1922-11-26</born>
    <died>2000-02-12</died>
   </author>
   <character id="PP">
    <name>Peppermint Patty</name>
    <born>1966-08-22</born>
    <qualification>bold, brash and tomboyish</qualification>
    </character>
   <character id="Snoopy">
    <name>Snoopy</name>
    <born>1950-10-04</born>
    <qualification>extroverted beagle</qualification>
   </character>
   <character id="Schroeder">
    <name>Schroeder</name>
    <born>1951-05-30</born>
    <qualification>brought classical music to the Peanuts strip</qualification>
   </character>
   <character id="Lucy">
    <name>Lucy</name>
    <born>1952-03-03</born>
    <qualification>bossy, crabby and selfish</qualification>
   </character>
  </book>
 </library>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
First Patterns
In plain English, the document, shown in Example 3-1 can be described as having:
  • One library element composed of:
    • One of more book elements having:
      • An id attribute and an available attribute
      • An isbn element composed of text
    • A title element with an xml:lang attribute and a text node
  • One or more author elements with:
    • An id attribute
    • A name element
    • An optional born element
    • An optional died element
  • Zero or more character elements with:
    • An id attribute
    • A name element
    • An optional born element
    • A qualification element`
The good news—and what makes RELAX NG so easy to learn—is that in its simplest form, RELAX NG is pretty much a way to formalize the previous statements with simple matching rules. Terms described in the plain English description have matching terms in the RELAX NG Schema document that look a lot like XML:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Complete Schema
You now have all the patterns needed to write a full schema that expresses what we've discussed about this example:
 <?xml version = '1.0' encoding = 'utf-8' ?>
 <element xmlns="http://relaxng.org/ns/structure/1.0" name="library">
  <oneOrMore>
   <element name="book">
    <attribute name="id"/>
    <attribute name="available"/>
    <element name="isbn">
     <text/>
    </element>
    <element name="title">
     <attribute name="xml:lang"/>
     <text/>
    </element>
    <oneOrMore>
     <element name="author">
      <attribute name="id"/>
      <element name="name">
       <text/>
      </element>
      <optional>
       <element name="born">
        <text/>
       </element>
      </optional>
      <optional>
       <element name="died">
        <text/>
       </element>
      </optional>
     </element>
    </oneOrMore>
    <zeroOrMore>
     <element name="character">
      <attribute name="id"/>
      <element name="name">
       <text/>
      </element>
      <optional>
       <element name="born">
        <text/>
       </element>
      </optional>
      <element name="qualification">
       <text/>
      </element>
     </element>
    </zeroOrMore>
   </element>
  </oneOrMore>
 </element>
RELAX NG directly supports four kinds of occurrence constraints on nodes: they may appear as exactly once (the default), optional, zero or more, or one or more. These are the most common cases in document design. If applications need a finer level of control, that can be achieved by using or combining these four basic occurrence constraints. If, for instance, you need to define that each book's description should have between two and six character elements, you can write the definition as two mandatory characters followed by four optional ones:
<!-- 1 -->
<element name="character">
 <attribute name="id"/>
 <element name="name">
  <text/>
 </element>
 <optional>
  <element name="born">
   <text/>
  </element>
 </optional>
 <element name="qualification">
  <text/>
 </element>
</element>
<!-- 2 -->
<element name="character">
  .../...
</element>
<!-- 3 -->
<optional>
 <element name="character">
 .../...
  </element>
 </element>
</optional>
<!-- 4 -->
<optional>
 <element name="character">
 .../...
 </element>
</optional>
<!-- 5 -->
<optional>
 <element name="character">
 .../...
 </element>
</optional>
<!-- 6 -->
<optional>
 <element name="character">
 .../...
 </element>
</optional>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Introducing the Compact Syntax
Although the schema shown in Chapter 3 is simple, its XML representation is rather verbose. This is neither surprising nor uncommon for XML vocabularies. In fact it conforms to the basic principles of XML; the W3C Recommendation's design goals state that "XML documents should be human-legible and reasonably clear" and that "terseness in XML markup is of minimal importance." Our schema is a good example of a "human-legible and reasonably clear" document that's definitely not terse!
The principal goal of RELAX NG's XML syntax is to provide a serialization of RELAX NG schemas that can be processed by computers using standard XML toolkits. To make it easier for people to read and write RELAX NG schemas, however, James Clark introduced a second syntax that is strictly equivalent to the XML syntax, a more concise compact syntax .
RELAX NG processors can support this compact syntax, but they aren't required to do so. If a RELAX NG processor doesn't support the compact syntax, you can translate the XML syntax to and from the compact syntax using existing translators. Because these two forms are strictly equivalent, there's no loss of information during translation. Even comments and annotations (presented in Chapter 13) are preserved in the process.
Syntactical details of XML, such as entity references or processing instructions, are lost when the XML syntax is translated into the compact syntax, but this is a limitation of the XML processing architecture rather than a limitation of RELAX NG itself.
You'll see that the compact syntax is built on a mix of concepts borrowed from the definition of structures in programming languages, notations from XML DTDs, and RELAX NG patterns. Element and attribute patterns look like Java declarations, with their curly brackets preceded by a reserved word, element or attribute, and their RELAX NG pattern name. Optionally, one or more, and zero or more elements or attributes are represented by DTD qualifier suffixes (? for optional,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
First Compact Patterns
Let's explore how the patterns described in the previous chapter translate into the compact syntax.
text is the simplest pattern in the XML syntax and is the simplest in the compact syntax as well. The text pattern is just:
text
In this definition, the word text identifies the text pattern.
Of course, because both syntaxes are equivalent, all that's been said about text in RELAX NG's XML syntax also applies to text in the compact syntax.
For the compact syntax, the attribute pattern borrows Java's curly brackets:
 attribute id { text }
In this definition, the first word, attribute, identifies the attribute pattern; the second one, id, is the name of the attribute. The curly brackets, {...}, delimit the definition of the content of the attribute.
Because empty curly brackets ({}) look weird and might imply empty attributes rather than attributes containing a text value, the convention of the XML syntax that makes a text pattern the implicit content for attributes is abandoned in the compact syntax. The content of attributes must be explicitly defined when you're using the compact syntax. In other words, in the compact system, the following:
<attribute name="id"/>
translates into:
 attribute id { text }
while this:
attribute id { }
translates into a syntax error.
The compact syntax is position-sensitive, and words such as text and attribute are reserved words only when they appear in the first position. This is very convenient when you need to define attributes (or elements) that have names that are the same as reserved words. For instance, you can define attributes named text or even attribute without any precaution such as:
attribute text { text } 
attribute attribute { text }
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Full Schema
Now we have all the components needed to convert the full RELAX NG schema from Chapter 3 into its compact syntax form; it's shown in Example 4-1.
Example 4-1. Compact syntax of full RELAX NG schema
element library {
   element book {
     attribute id { text },
     attribute available { text },
     element isbn { text },
     element title {
       attribute xml:lang { text },
       text
     },
     element author {
       attribute id { text },
       element name { text },
       element born { text }?,
       element died { text }?
     }+,
     element character {
       attribute id { text },
       element name { text },
       element born { text }?,
       element qualification { text }
     }*
   }+
}
In the following chapters, I give both the XML and the compact syntax for each example. You'll have plenty of opportunities to get familiar with both.
Don't get confused by the similarities in name between the simple form of a RELAX NG schema, described in Chapter 15, and the compact syntax. These two notions work at different levels: the simple form is the result of simplifications performed internally by RELAX NG processors on the data model of the schema; the compact syntax is a different way to represent or serialize a full RELAX NG document. The data models that result from the parsing of a full RELAX NG schema are thus the same whether the schema is written using the XML or the compact syntax and are simplified into the same simple schema.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML or Compact?
Figure 4-1 presents both syntaxes side by side. There are two things you'll immediately notice. The compact syntax is much more, well, compact. The XML syntax is, just as you'd expect, XML. It works well with generic XML tools (here a web browser), while the compact syntax isn't XML and must be used with other tools (here the text editor vim with a plug-in that highlights RELAX NG's compact syntax).
Figure 4-1: Comparing the RELAX NG XML syntax with its smaller compact syntax counterpart
These two statements summarize why both syntaxes are needed. The compact syntax is nice to work with, and you'll probably find it more pleasant to use to edit your schemas and to document your vocabularies. On the other hand, the XML syntax is wonderful if you want to generate RELAX NG schemas, as in Chapter 14 or to generate anything out of your RELAX NG schemas using the XML tools covered in Chapter 13. The ability to translate from one syntax to the other without information loss guarantees that you can use either while having access to both.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: Flattening the First Schema
If you look at the structure of the Russian doll-style schema, you'll see that it follows the structure of the instance document it applies to, as shown in Figure 3-1. Writing the first schema has pretty much been limited to inserting text, element, or attribute elements into the schema each time a text node, element, or attribute was encountered in the instance document. This method of creating schemas can be seen as a serialization of the XML infoset (i.e., of the structure available in the document) and could, therefore, be easily automated.
Automated serialization is the principle behind Examplotron, a program described in Chapter 14.
There are a couple of drawbacks to modeling documents with the Russian doll-style schemas, however. First, they aren't modular and therefore become difficult to read and maintain when documents are large or complex. Second, they can't represent recursive (self-referencing) models. (Lists that may themselves contain lists are a common case of this model.)
The lack of modularity can be seen in a document as simple as the first schema, shown in Example 3-1. There's a name element that uses the same model within both the character and author elements.
Figure 5-1 shows how, in the first schema, you need to give the definition of what name means in each context:
Figure 5-1: Two different definitions of name in the same schema
You might think that the extra text won't make a difference, but that's not completely true. The additional verbosity here is innocuous because the definition of the name element is simple, and thus not verbose. The principle is the same if the definition is complex, however. It will require redundancy. This redundancy makes maintenance of the schema more error-prone. If I need to update the definition of the name element, I'll need to update it as many times at it appears, but I'll give myself more room for mistakes. Common sense applies the same rules to XML schema languages as to any programming language. Limiting repetitive work makes developers happy!
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Defining Named Patterns
RELAX NG uses named patterns to address both modularity and recursion. Named patterns are reusable patterns that can be referenced by their name.
In the XML syntax, named patterns are defined using define elements. To define named patterns that contain the title element, write:
<define name="title-element">
 <element name="title">
  <text/>
 </element>
</define>
The compact syntax uses a construction similar to a programming language format. The same definition would be written in the compact syntax as:
title-element = element title {text}
You're not limited to embedding a single element or attribute definition in a named pattern. Note that the group shown in Figure 5-2, an id attribute, a name element, and an optional born element are present in the same order and with the same definition in both the author and the character element.
Figure 5-2: Groups of identical attributes on different element types
To define a named pattern for this group, write:
<define name="common-content">
 <attribute name="id"/>
 <element name="name">
  <text/>
 </element>
 <optional>
  <element name="born">
   <text/>
  </element>
  </optional>
</define>
or:
common-content =
  attribute id { text },
  element name { text },
  element born { text }?
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Referencing Named Patterns
Defining a named pattern is easy, as shown the earlier example, but referencing a named pattern rather than defining it again is even simpler.
Using the XML syntax, references to named patterns defined elsewhere in the schema are done using a ref element. For instance, to define the author element, use a reference to the name-element pattern:
 <element name="author">
  <attribute name="id"/>
  <ref name="name-element"/>
  <optional>
   <element name="born">
    <text/>
    </element>
  </optional>
  <optional>
   <element name="died">
    <text/>
   </element>
  </optional>
 </element>
To reference a named pattern in the compact syntax, just use its name directly:
element author {
  attribute id { text },
  name-element,
  element born { text }?,
  element died { text }?
}
The same approach can reference the common-content named pattern:
 <element name="author">
  <ref name="common-content"/>
  <optional>
   <element name="died">
    <text/>
   </element>
  </optional>
 </element>
or:
element author {
  common-content,
  element died { text }?
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The grammar and start Elements
In the Russian doll-style, the definition of the root element (in this case, the library element) is used as a container for the whole schema. When you define named patterns, you need a container to embed both the named pattern definitions and the definition of the root element of the named patterns. This definition of the root element, as well as definitions of all the patterns that may be used within it, is what RELAX NG calls a grammar. It uses the grammar element. When you use a grammar element, RELAX NG requires you to explicitly declare the root element or elements, using a start element. An incomplete skeleton of the structure of the schema defining a pattern name-element would thus be:
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
 <start>
  <element name="library">
   .../...
  </element>
 </start>
 <define name="name-element">
  .../...
 </define>
</grammar>
or, using the compact syntax:
grammar {
 name-element = .../...
 start =
  element library {
  .../...
 }
}
In the compact syntax, the grammar pattern is implicit. You can use it, but it isn't required.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Assembling the Parts
You have seen the different bits and pieces needed to define and reference patterns. It's time to put them all together and create a complete schema. The first exercise is to define a DTD-like RELAX NG schema that defines each element and its own named pattern.
The full schema might look like this:<