RELAX NG
RELAX NG is a powerful schema validation language that builds on earlier work including RELAX and TREX. Like W3C Schema, it uses XML syntax and supports namespaces and data typing. It goes further by integrating attributes into content models, which greatly simplifies the structure of the schema. It offers superior handling of unordered content and supports context-sensitive content models.
In general, it just seems easier to write schemas in RELAX NG than
in W3C Schema. The syntax is very clear, with elements like zeroOrMore
for specifying optional repeating
content. Declarations can contain other declarations, leading to a more
natural representation of a document’s structure.
Consider the simple schema in Example 4-7 which models a document type for logging work activity. It’s easy to read this schema and understand the structure of a typical document.
<element name="worklog" xmlns="http://relaxng.org/ns/structure/1.0" xmlns:ann="http://relaxng.org/ns/compatibility/annotations/1.0"> <ann:documentation>A document for logging work activity, broken down into days, and further into tasks.</ann:documentation> <zeroOrMore> <element name="day"> <attribute name="date"> <text/> </attribute> <zeroOrMore> <element name="task"> <element name="description"> <text/> </element> <element name="time-start"> <text/> </element> <element name="time-end"> <text/> </element> </element> </zeroOrMore> </element> </zeroOrMore> </element>
The same thing would look like this as a DTD:
<!ELEMENT worklog (day*)> <!ELEMENT day (task*)> <!ELEMENT task (description, time-start, time-end)> <!ELEMENT description #PCDATA> <!ELEMENT time-start #PCDATA> <!ELEMENT time-end #PCDATA> <!ATTLIST day date CDATA #REQUIRED>
Although the DTD is more compact, it relies on a special syntax that is decidedly not XML-ish. RELAX NG accomplishes the same thing with more readability.
Tip
RELAX NG also offers a compact syntax that looks somewhat like a DTD but offers all the features of RELAX NG. For a brief introduction, see http://www.xml.com/pub/a/2002/06/19/rng-compact.html. James Clark’s Trang program, available at http://www.thaiopensource.com/relaxng/trang.html, makes it easy to convert between RELAX NG, RELAX NG Compact Syntax, and DTDs, as well as create W3C XML Schema from any of these formats.
The basic component of a RELAX NG schema is a pattern. A pattern denotes any construct that describes the order and types of structure and content. It can be an element declaration, an attribute declaration, character data, or any combination. Elements in the schema are used to group, order, and parameterize these patterns.
Note that any element or attribute in a namespace other than the
RELAX NG namespace (http://relaxng.org/ns/structure/1.0) is
simply ignored by the parser. That gives us a mechanism for putting
in comments or
annotations, which explains why I created the ann
namespace in the previous example.
Elements
The element
construct is used
both to declare an element and to establish where the
element can appear (when placed inside another element
declaration). For example, the
following schema declares three elements, report
, title
, and body
, and specifies that the first element
contains the other two in the exact order and number that they
appear:
<element name="report" xmlns="http://relaxng.org/ns/structure/1.0"> <element name="title"> <text/> </element> <element name="body"> <text/> </element> </element>
Whitespace between these elements is allowed, as it would be for
a DTD. The text
element, which is
always empty, restricts the content of the inner elements to character
content.
Repetition
To allow for repeating children, RELAX NG provides two
modifier elements, zeroOrMore
and
oneOrMore
. They function like DTD’s star (*
) and plus (+
) operators, respectively. In this
example, the body
element has
been modified to allow an arbitrary number of para
elements:
<element name="report" xmlns="http://relaxng.org/ns/structure/1.0"> <element name="title"> <text/> </element> <element name="body"> <zeroOrMore> <element name="para"> <text/> </element> </zeroOrMore> </element> </element>
Choices
The question mark (?
)
operator in DTDs means that an element is optional (zero or one in
number). In RELAX NG, you can achieve that effect with the optional
modifier. For example, this
schema allows you to insert an optional authorname
element after the title
:
<element name="report" xmlns="http://relaxng.org/ns/structure/1.0"> <element name="title"> <text/> </element> <optional> <element name="authorname"> <text/> </element> </optional> <element name="body"> <text/> </element> </element>
It is also useful to offer a choice of elements. Corresponding
to DTD’s vertical bar (|
)
operator is the modifier choice
.
Here, we require either an authorname
or a source
element after the title
:
<element name="report" xmlns="http://relaxng.org/ns/structure/1.0"> <element name="title"> <text/> </element> <choice> <element name="authorname"> <text/> </element> <element name="source"> <text/> </element> </choice> <element name="body"> <text/> </element> </element>
This declaration combines choice
with zeroOrMore
to create a container that can
have mixed content (text plus elements, in any order):
<element name="paragraph" xmlns="http://relaxng.org/ns/structure/1.0"> <zeroOrMore> <choice> <text/> <element name="emphasis"> <text/> </element> </choice> </zeroOrMore> </element>
Grouping
For a required sequence of children, you can use the group
modifier, which functions much like
parentheses in DTDs. For example, here the (now required) authorname
is either plain text or a
sequence of elements:
<element name="report" xmlns="http://relaxng.org/ns/structure/1.0"> <element name="title"> <text/> </element> <element name="authorname"> <choice> <text/> <group> <element name="first"><text/></element> <element name="last"><text/></element> </group> </choice> </element> <element name="body"> <text/> </element> </element>
The group
container is
necessary because without it the first
and last
elements would be part of the choice
and become mutually exclusive.
DTDs provide no way to require a group of elements in which
order is not significant but contents are
required. RELAX NG provides a container called interleave
which does just that. It
requires all the children to be present, but in any order. In the
following example, title
can come
before authorname
, or it can come
after:
<element name="report" xmlns="http://relaxng.org/ns/structure/1.0"> <interleave> <element name="title"> <text/> </element> <element name="authorname"> <text/> </element> </interleave> <element name="body"> <text/> </element> </element>
Nonelement content descriptors
The text
content descriptor is only one of several options for
describing non-element content. Here’s the full assortment:
Name | Content |
| No content at all |
| Any string |
| A predetermined value |
| Text following a specific pattern (datatype) |
| A sequence of values |
The empty
marker precludes
any content. With this declaration, the element bookmark
is not allowed to appear in any
form other than as an empty element:
<element name="bookmark"> <empty/> </element>
RELAX NG provides the value
descriptor for matching a string of characters. For example, here is
an enumeration of values for a size
element:
<element name="size"> <choice> <value>small</value> <value>medium</value> <value>large</value> </choice> </element>
By default, value
normalizes the string, removing extra space characters. The example
element below would be accepted by the previous declaration:
<size> small </size>
If you want to turn off normalization and require exact string
matching, you need to add a type="string
" attribute. The following
declaration would reject the above element’s content because of its
extra space:
<element name="size"> <choice> <value type="string">small</value> <value type="string">medium</value> <value type="string">large</value> </choice> </element>
The most interesting content descriptor is data
. This is the vehicle for using
datatypes in RELAX NG. Its type
attribute contains the name of a type defined in a datatype library.
(Don’t worry about what that means yet, we’ll get to it in a
moment.) The content of the element declared here is set to be an
integer value:
<element name="font-size"> <data type="integer"/> </element>
One downside to using data
is that it can’t be mixed with elements in content, unlike text
.
The list
descriptor
contains a sequence of space-separated tokens. A token is a special
type of string consisting only of nonspace characters. Token lists
are a convenient way to represent sets of discrete data. Here, one
is used to encode a set of numbers:
<element name="vector"> <list> <oneOrMore> <data type="float"/> </oneOrMore> </list> </element>
Here is an acceptable vector
:
<vector>44.034 19.0 -65.33333</vector>
Note how the oneOrMore
descriptor works just as well with text as it does with elements.
It’s yet another example of how succinct and
flexible RELAX NG is.
Data Typing
Although RELAX NG supports datatyping, the specification only includes two
built-in types: string
and token
. To use other kinds of datatypes, you
need to import them from another specification. You do this by setting
a datatypeLibrary
attribute like
so:
<element name="font-size"> <data type="integer" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"/> </element>
This will associate the datatype definitions from the W3C Schema specification with your schema. The datatypes you can use depend on the implementation of your RELAX NG validating parser.
It isn’t so convenient to put the datatypeLibrary
attribute in every data
element. The good news is it can be
inherited from any ancestor in the schema. Here, we declare it once in
an element declaration, and all the data
descriptors inside call from that
library:
<element name="rectangle" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <element name="width"> <data type="double"/> </element> <element name="height"> <data type="double"/> </element> </element>
String and token
Both string
and token
match arbitrary strings of legal XML character data.
The difference is that token
normalizes whitespace and string
keeps whitespace as is. They correspond to the datatypes value
and fixed
, respectively.
Parameters
Some datatypes allow you to specify a parameter to further
restrict the pattern. This is expressed with a param
element as a child of the data
element. For example, the element
below restricts its content to a string of no more than eight
characters:
<element name="username" xmlns="http://relaxng.org/ns/structure/1.0"> <data type="string"> <param name="maxLength">8</param> </data> </element>
Attributes
Attributes are declared much the same way as elements. In this example, we add
a date
attribute to the report
element:
<element name="report" xmlns="http://relaxng.org/ns/structure/1.0"> <attribute name="date"> <text/> </attribute> <element name="title"> <text/> </element> <element name="body"> <text/> </element> </element>
Unlike elements, the order of attributes is not significant. In
this next example, the attributes for any alert
element can appear in any order, even
though we are not using a choice
element:
<element name="alert" xmlns="http://relaxng.org/ns/structure/1.0"> <attribute name="priority"> <value>emergency</value> <value>important</value> <value>warning</value> <value>notification</value> </attribute> <attribute name="icon"> <value>bomb</value> <value>exclamation-mark</value> <value>frown-face</value> </attribute> <element name="body"> <text/> </element> </element>
Element and attribute declarations are interchangeable. Here, we
use a choice
element to provide two
cases: one with an icon
attribute
and another with an icon
element.
The two are mutually exclusive:
<element name="alert" xmlns="http://relaxng.org/ns/structure/1.0"> <choice> <attribute name="icon"> <value>bomb</value> <value>exclamation-mark</value> <value>frown-face</value> </attribute> <element name="icon"> <value>bomb</value> <value>exclamation-mark</value> <value>frown-face</value> </element> </choice> <element name="body"> <text/> </element> </element>
Another difference with element declarations is that there is a shorthand form in which the lack of any content information defaults to text. So, for example, this declaration:
<element name="emphasis"> <attribute name="style"/> </element>
...is equivalent to this:
<element name="emphasis"> <attribute name="style"> <text/> </attribute> </element>
This interchangeability between element and attribute declarations makes the schema language much simpler and more elegant.
Namespaces
RELAX NG is fully namespace aware. You can include namespaces in any
name
attribute using the xmlns
attribute:
<element name="poem" xmlns="http://relaxng.org/ns/structure/1.0"> xmlns:foo="http://www.mystuff.com/commentary"> <optional> <attribute name="xml:space"> <choice> <value>default</value> <value>preserve</value> </choice> </attribute> </optional> <zeroOrMore> <choice> <text/> <element name="foo:comment"><text/></element> </choice> </zeroOrMore> </element>
Add the attribute ns
to any
element or attribute declaration to set an implicit namespace context.
For example, this declaration:
<element name="vegetable" ns="http://www.broccoli.net" xmlns="http://relaxng.org/ns/structure/1.0"> <empty/> </element>
would match either of these:
<food:vegetable xmlns:food="http://www.broccoli.net"/> <vegetable xmlns="http://www.broccoli.net"/>
...but fail to match these:
<vegetable/> <food:vegetable xmlns:food="http://www.uglifruit.org"/>
The namespace setting is inherited, allowing you to set it once
at a high level. Here, the inner element declarations for title
and body
implicitly require the namespace
http://howtowrite.info
:
<element name="report" ns="http://howtowrite.info" xmlns="http://relaxng.org/ns/structure/1.0"> <element name="title"><text/></element> <element name="body"><text/></element> </element>
Name Classes
A name class is any pattern that substitutes for a set of element or
attribute types. We’ve already seen one, choice
, which matches an enumerated set of
elements and attributes. Even more permissive is the name class
anyName
, which allows any element
or attribute type to have the described content model.
For example, this pattern matches any well-formed document:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <ref name="all-elements"/> </start> <define name="all-elements"> <element> <anyName/> <!-- use in place of the "name" attribute --> <zeroOrMore> <choice> <ref name="anyElement"/> <text/> <attribute><anyName/></attribute> </choice> </zeroOrMore> </element> </define> </grammar>
The anyName
appears inside
the element
instead of a name
attribute. The zeroOrMore
is required here because each
name class element matches exactly one object.
The nsName
class matches any
element or attribute in a namespace specified by an ns
attribute. For example:
<element xmlns="http://relaxng.org/ns/structure/1.0"> <nsName ns="http://fakesite.org" /> <empty/> </element>
This will set any element in the namespace http://fakesite.org
to be an empty element.
If you leave out the ns
attribute,
nsName
will inherit the namespace
from the nearest ancestor that defines one. So this will also
work:
<element ns="http://fakesite.org" xmlns="http://relaxng.org/ns/structure/1.0"> > <nsName /> <empty /> </element>
If you don’t want to let everything
through, trim down the set using except
. Use it as a child to anyName
or nsName
to list classes of elements or
attributes you don’t want to allow. Here, only elements not in the
current namespace are declared empty:
<element ns="http://fakesite.org" xmlns="http://relaxng.org/ns/structure/1.0"> <anyName> <except> <nsName /> </except> </anyName> <empty /> </element>
The only place you cannot use a name class is as the child of a
define
element. This is
wrong:
<define name="too-ambiguous"> <anyName/> </define>
We’ll discuss define
elements in the
next section.
Tip
As this book was going to press, James Clark announced the Namespace Routing Language (NRL), which provides enormous flexibility for describing how content in different namespaces should be validated and processed. See http://www.thaiopensource.com/relaxng/nrl.html for more information and an implementation.
Named Patterns
The patterns we have seen so far are monolithic. All the declarations are nested inside one big one. This is fine for simple documents, but as complexity builds, it can be hard to manage. Named patterns allow you to move declarations outside of the main pattern, breaking up the schema into discrete parts that are more easily handled. It also allows for reusing patterns that recur in many places.
A schema that uses named patterns follows this layout:
<grammar> <start>main pattern
</start> <define name="identifier
">pattern
</define>more pattern definitions
</grammar>
The outermost grammar
element
encloses both the main pattern and a set of named pattern definitions.
It contains exactly one start
element with the primary pattern, and any number of define
elements, each defining a named
pattern. Named patterns are imported into a pattern using a ref
element. For example:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="report"> <ref name="head"/> <ref name="body"/> </element> </start> <define name="head"> <element name="title"> <text/> </element> <element name="authorname"> <text/> </element> </define> <define name="body"> <zeroOrMore> <element name="paragraph"> <text/> </element> </zeroOrMore> </define> </grammar>
The start
element must
contain exactly one pattern. However, a define
may contain any number of children,
since its contents will be copied into another pattern.
You can write a grammar to fit the style of DTDs, with one definition per element:[6]
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="report"> <ref name="title"/> <ref name="authorname"/> <zeroOrMore> <ref name="paragraph"/> </zeroOrMore> </element> </start> <define name="title"> <element name="title"> <text/> </element> </define> <define name="authorname"> <element name="authorname"> <text/> </element> </define> <define name="paragraph"> <element name="paragraph"> <text/> </element> </define> </grammar>
Recursive definitions
Recursive definitions are allowed, as long as the ref
is enclosed inside an element
. This pattern describes a section
element that can contain subsections arbitrarily deep:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="report"> <element name="title"><text/></element> <zeroOrMore> <ref name="paragraph"> </zeroOrMore> <zeroOrMore> <ref name="section"/> </zeroOrMore> </element> </start> <define name="paragraph"> <element name="paragraph"> <text/> </element> </define> <define name="section"> <element name="section"> <zeroOrMore> <ref name="paragraph"/> </zeroOrMore> <zeroOrMore> <ref name="section"/> </zeroOrMore> </element> </define> </grammar>
Failing to put the ref
inside an element
in a recursive
definition would set up a logical infinite loop. So this is
illegal:
<define name="foo"> <choice> <ref name="bar"/> <ref name="foo"/> </choice> </define>
The order of definitions for named patterns doesn’t matter. As
long as every referenced pattern has a definition within the same
grammar
, everything will be
kosher.
Aggregate definitions
Multiple pattern definitions with the same name are illegal unless you use the
combine
attribute. This tells the
processor to merge the definitions into one, grouped with either a
choice
or an interleave
container. The value of this
attribute describes how to combine the parts. For example:
<define name="block.class" combine="choice"> <element name="title"> <text/> </element> </define> <define name="block.class" combine="choice"> <element name="para"> <text/> </element> </define>
...which is equivalent to this:
<define name="block.class" xmlns="http://relaxng.org/ns/structure/1.0"> <choice> <element name="title"> <text/> </element> <element name="para"> <text/> </element> </choice> </define>
The usefulness of aggregate definitions becomes more clear when used with patterns in other files.
Modularity
Good housekeeping of schemas often requires putting pieces in different files. Not only will it make parts smaller and easier to manage, but it allows them to be shared between schemas.
External references
The pattern externalRef
functions like ref
and uses
the attribute href
to locate the
file containing a grammar. externalRef
references the whole grammar
, not a named pattern inside
it.
Suppose we have a file section.rng containing this pattern:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <ref name="section"/> </start> <define name="section"> <element name="section"> <zeroOrMore> <ref name="paragraph"/> </zeroOrMore> <zeroOrMore> <ref name="section"/> </zeroOrMore> </element> </define> <define name="paragraph"> <text/> </define> </grammar>
We can link it to a pattern in another file like this:
<element name="report" xmlns="http://relaxng.org/ns/structure/1.0"> <element name="title"><text/></element> <oneOrMore> <externalRef href="section.rng"/> </oneOrMore> </element>
Nested grammars
One consequence of external referencing is that grammars effectively contain
other grammars. To prevent name clashes, each grammar
has its own scope for named
patterns. The named patterns in a parent are not automatically
available to its child grammars. Instead, ref
will only reference a definition from
inside the current grammar
.
To get around that limitation, you can use parentRef
. It functions like ref
but looks for definitions in the
grammar one level up. For example, consider this case where two
grammars reference each other. I am defining one element, para
, as a paragraph that can include
footnotes. The footnote
element
contains some number of para
s.
They are stored in files para.rng and footnote, respectively, and shown in
Examples Example 4-8 and
Example 4-9.
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="para"> <zeroOrMore> <choice> <ref name="para.content"/> <externalRef name="footnote.rng"/> </choice> </zeroOrMore> </element> </start> <define name="para.content"> <text/> </define> </grammar>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="footnote"> <oneOrMore> <parentRef name="para.content"/> </oneOrMore> </element> </start> </grammar>
The footnote pattern relies on its parent grammar to define a
pattern for para
.
Merging grammars
You can merge grammars from external sources by using include
as a child of grammar
. Like externalRef
, include
uses an href
attribute to source in the
definitions. However, it actually incorporates them in the same
context, unlike externalRef
which
keeps scopes for named patterns separate.
One use for include
is to
augment an existing definition with more patterns. Suppose, for
example, this pattern is located in block.rng:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <ref name="block.class"/> </start> <define name="block.class"> <choice> <element name="title"> <text/> </element> <element name="para"> <text/> </element> </choice> </define> </grammar>
I can add more items to this class by including it like so:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <include href="block.rng"> <start> <oneOrMore> <element name="section"> <ref name="block.class"/> </element> </oneOrMore> </start> <define name="block.class" combine="choice"> <element name="poem"> <text/> </element> </define> </grammar>
The combine
attribute is
necessary to tell the processor how to incorporate the new
definition with the previous one imported from block.rng. Note that for multiply defined
patterns of the same name, one is allowed to leave out the combine
attribute, as is the case in the
file block.rng.
Overriding imported definitions
You can override some definitions that you import by including new ones
inside the include
element. Say
we have a file report.rng
defined like this:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="report"> <ref name="head"/> <ref name="body"/> </element> </start> <define name="head"> <element name="title"><text/></element> </define> <define name="body"> <element name="section"> <oneOrMore> <element name="para"><text/></element> </oneOrMore> </element> </define> </grammar>
We wish to import this grammar, but adjust it slightly.
Instead of just a title
, we want
to allow a subtitle as well. Rather than rewrite the whole grammar,
we can just redefine head
:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <include href="report.rng"> <define name="head"> <element name="title"><text/></element> <optional> <element name="subtitle"><text/></element> </optional> </define> </include> <start> <ref name="report"> </start> </grammar>
This is a good way to customize a schema to suit your own particular taste.
CensusML Example
In case you are curious, let’s go back to the CensusML example from Section 4.3 and try to do it as a RELAX NG schema. The result is Example 4-10.
<element name="census-record"> xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <attribute name="taker"> <data type="integer"> <param name="minInclusive">1</param> <param name="maxInclusive">9999</param> </data> </attribute> <element name="date"> <data type="date"/> </element> <element name="address"> <interleave> <element name="street"><text/></element> <element name="city"><text/></element> <element name="county"><text/></element> <element name="postalcode"> <data type="string"> <param name="pattern">[0-9][0-9][0-9][A-Z][A-Z][A-Z]</param> </data> </element> </interleave> </element> <oneOrMore> <element name="person"> <interleave> <attribute name="employed"> <choice> <value>fulltime</value> <value>parttime</value> <value>none</value> </choice> </attribute> <attribute name="pid"> <data type="integer"> <param name="minInclusive">1</param> <param name="maxInclusive">999999</param> </data> </attribute> <element name="age"> <data type="integer"> <param name="minInclusive">0</param> <param name="maxInclusive">200</param> </data> </element> <element name="gender"> <choice> <value>male</value> <value>gender</value> </choice> </element> <element name="name"> <interleave> <element name="first"><text/></element> <element name="last"><text/></element> <optional> <choice> <element name="junior"><empty/></element> <element name="senior"><empty/></element> </choice> </optional> </interleave> </element> </interleave> </element> </oneOrMore> </element>
This schema certainly looks a lot cleaner than the W3C Schema version. Enumerations and complex types are much more clear. The grouping structures are very easy to read. Personally, I think RELAX NG is just more intuitive all around.
[6] This is how DTDs can be mapped directly into RELAX NG schema. This kind of backward compatibility is important, since most people are still using DTDs. So this is a good way to upgrade to RELAX NG.
Get Learning XML, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.