Learning XML, 2nd Edition

W3C XML Schema

DTDs are chiefly directed toward describing how elements are arranged in a document. They say very little about the content in the document, other than whether an element can contain character data. Although attributes can be declared to be of different types (e.g. ID, IDREF, enumerated), there is no way to constrain the type of data in an element.

Returning to the example in Section 4.2.3, we can see how this limitation can be a serious problem. Suppose that a census taker submitted the document in Example 4-5.

Example 4-5. A bad CensusML document

<census-record taker="9170">
  <date><month>?</month><day>110</day><year>03</year></date>
  <address>
    <city>Munchkinland</city>
    <street></street>
    <county></county>
    <country>Here, silly</country>
    <postalcode></postalcode>
  </address>
  <person employed="fulltime" pid="?">
    <name>
      <last>Burgle</last>
      <first>Brad</first>
    </name>
    <age>2131234</age>
    <gender>yes</gender>
  </person>
</census-record>

There are a lot of things wrong with this document. The date is in the wrong format. Several important fields were left empty. The stated age is an impossibly large number. The gender, which ought to be “male” or “female,” contains something else. The personal identification number has a bad value. And yet, to our infinite dismay, the DTD would pick up none of these problems.

It isn’t hard to write a program that would check the data types, but that’s a low-level operation, prone to bugs and requiring technical ability. It’s also getting away from the point of DTDs, which is to create a kind of metadocument, a formal description of a markup language. Programming languages aren’t portable and don’t work well as a way of conveying syntactic and semantic details. So we have to conclude that DTDs don’t go far enough in describing a markup language.

To make matters worse, what the DTD will reject as bad markup are often trivial things. For example, the contents of date and name are not in the specific order required by their element declarations. This seems unnecessarily picayune, but it’s actually very difficult to write a content model that allows its children to appear in any order.

To illustrate the problem, let’s try to make the date more flexible, so that it accepts children in any order. The best I can think of is to write the declaration like this:

<!ELEMENT date (
        (year,  ((month, day)  | (day, month)))
      | (month, ((year, day)   | (day, year)))
      | (day,   ((month, year) | (year, month)))
)>

Pretty ugly, isn’t it? And that’s only with three child elements. This is another serious drawback of DTDs.

Perhaps the most damaging limitation of DTDs is the lockdown of namespace. Any element in a document has to have a corresponding declaration in the DTD. No exceptions. This is fundamentally at odds with XML namespaces, which allow you to import vocabularies from anywhere. Granted, there are good reasons to want to limit the kinds of elements used: more efficient validation and preventing illegal elements from appearing. But there’s no way to turn this feature off if you don’t want it.^[5]

To address problems like these, a new validation system was invented called schema. Like DTDs, schemas contain rules that all must be satisfied for a document to be considered valid. Unlike DTDs, however, schemas are not built into the XML specification. They are an add-on technology that you can use, provided you have access to parsers that will support it.

There are several competing kinds of schema. The one that is sanctioned by the W3C is called XML Schema. Another proposal, called RELAX NG, adds capabilities not found in XML Schema, such as regular expression matching in character data. Yet another popular alternative is Schematron. We’ll focus on the W3C variety in this section and visit alternatives in later sections.

XML Schemas are themselves XML documents. That’s a nice convenience, allowing you to check well-formedness and validity when you make modifications to a schema. It’s more verbose than a DTD, but still pretty readable and vastly more flexible.

From the census example, here is how you would define the county element:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="county" type="xs:string"/>
</xs:schema>

The xs:element element acts like an !ELEMENT declaration in a DTD. Its name attribute declares a name (“county”), and its type attribute defines a content model by referring to a data type identifier. Instead of using a compact string of symbols to define a content model, schemas define content models in a separate place, then refer to them inside the element definition. This is kind of like using parameter entities, but as we will see it’s more flexible in schemas.

xs:string refers to a simple type of element, one that is built in to the Schema specification. In this case, it’s just a string of character data, about as simple as you can get. An alternative to xs:string is xs:token. It also contains a string, but normalizes the space (strips out leading and trailing space and collapses extra space characters) for you. Table 4-1 lists other simple types which are commonly used in schemas. There are many more types in W3C XML Schema Part 2: Datatypes, but this core covers most frequent needs and will get you started.

Table 4-1. Simple types commonly used in schemas

Type	Usage
`xs:string`	Contains any text.
`xs:token`	Contains textual tokens separated by whitespace.
`xs:QName`	Contains a namespace-qualified name.
`xs:decimal`	Contains a decimal number of arbitrary precision. (Processors must support a minimum of 18 digits.) “3.252333”, “-1.01”, and “+20” are all acceptable values.
`xs:integer`	Contains an integer number, like “0”, “35”, or “-1433322”
`xs:float`	Contains a 32-bit IEEE 754 floating point number.
`xs:ID, xs:IDREF, xs:IDREF`	Behave the same as the ID, IDREF, IDREFS in DTDs.
`xs:boolean`	Contains a true or false value, expressed as “true” or “false” or “1” or “0”.
`xs:time`	Contains a time in ISO 8601 format (HH:MM:SS-Timezone), like 21:55:00-06:00.
`xs:date`	Contains a date in ISO 8601 format (CCYY-MM-DD), like 2004-12-30.
`xs:dateTime`	Contains a date/time combination in ISO 8601 format (CCYY-MM-DDTHH:MM:SS-Timezone), like 2004-12-30T21:55:00-06:00.

Most elements are not simple, however. They can contain elements, attributes, and character data with specialized formats. So schemas also contain complex element definitions. Here’s how you could define the date element:

<xs:element name="date">
  <xs:complexType>
    <xs:all>
      <xs:element ref="year"/>
      <xs:element ref="month"/>
      <xs:element ref="day"/>
    </xs:all>
  </xs:complexType>
</xs:element>
<xs:element name="year" type="xs:integer"/>
<xs:element name="month" type="xs:integer"/>
<xs:element name="day" type="xs:integer"/>

The date element is a complex type because it has special requirements that you must explicitly define. In this case, the type is a group of three elements (in any order), referred to by name using the ref attribute. These referred elements are defined at the bottom to be of type integer.

It is possible to refine the date even further. Although the schema will guarantee that each of the subfields year, month, and day are integer values, it will allow some values we don’t want. For example, -125724 is a valid integer, but we wouldn’t want that to be used for month.

The way to control the range of a data type is to use facets. A facet is an additional parameter added to a type definition. You can create a new data type for the <month> element like this:

<xs:simpleType name="monthNum">
  <xs:restriction base="xs:integer">
    <xs:minInclusive value="1"/>
    <xs:maxInclusive value="12"/>
  </xs:restriction>
</xs:simpleType>
<xs:element name="month" type="monthNum"/>

Here, we created a named type and named it monthNum. Named types are not bound to any particular element, so they are useful if you’ll be using the same type over and over. In this type definition is an xs:restriction element from which we will derive a more specific type than the loose xs:integer. Inside are two facets, minInclusive and maxInclusive, setting the lower and upper bounds respectively. Any element set to the type monthNum will be checked to ensure its value is a number that falls inside that range.

Besides setting ranges, facets can create fixed values, constrain the length of strings, and match patterns with regular expressions. For example, say you want the postal code to be any string that contains three digits followed by three letters, as in the census example:

<postalcode>885JKL</postalcode>

A pattern to match this is [0-9][0-9][0-9][A-Z][A-Z][A-Z]. Even better: [0-9]{3}[A-Z]{3}. Here is how the schema element might look:

<xs:element name="postalcode" type="pcode"/>
<xs:simpleType name="pcode">
  <xs:restriction base="xs:token">
    <xs:pattern value="[0-9]{3}[A-Z]{3}"/>
  </xs:restriction>
</xs:simpleType>

Another way to define a type is by enumeration , defining a set of allowed values. The gender element, for example, may only contain two values: female or male. Here’s a gender type:

<xs:simpleType name="genderType">
  <xs:restriction base="xs:token">
    <xs:enumeration value="female"/>
    <xs:enumeration value="male"/>
  </xs:restriction>
</xs:simpleType>

Now, let me show you how I would write a schema for the CensusML document type. Example 4-6 shows my attempt.

Example 4-6. A schema for CensusML

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <!-- document element -->
  <xs:element name="census-record">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="date"/>
        <xs:element ref="address"/>
        <xs:element ref="person" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute ref="taker"/>
    </xs:complexType>
  </xs:element>

  <!-- Number identifying the census taker (1-9999) -->
  <xs:attribute name="taker">
    <xs:simpleType>
      <xs:restriction base="integer">
        <xs:minInclusive value="1"/>
        <xs:maxInclusive value="9999"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:attribute>

  <!-- structure containing date information -->
  <!-- this is a simplification over the previous definition using -->
  <!-- three subelements. -->
  <xs:element name="date" type="date"/>

  <!-- structure containing address information -->
  <xs:element name="address">
    <xs:complexType>
      <xs:all>
        <xs:element ref="street"/>
        <xs:element ref="city"/>
        <xs:element ref="county"/>
        <xs:element ref="country"/>
        <xs:element ref="postalcode"/>
      </xs:all>
    </xs:complexType>
  </xs:element>

  <xs:element name="street" type="string"/>
  <xs:element name="city" type="string"/>
  <xs:element name="county" type="string"/>
  <xs:element name="country" type="string"/>

  <!-- postalcode element: uses format 123ABC -->
  <xs:element name="postalcode">
    <xs:simpleType>
      <xs:restriction base="string">
        <xs:pattern value="[0-9]{3}[A-Z]{3}"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:element>

  <!-- structure containing data for one resident of the household -->
  <xs:element name="person">
    <xs:complexType>
      <xs:all>
        <xs:element ref="name"/>
        <xs:element ref="age"/>
        <xs:element ref="gender"/>
      </xs:all>
      <xs:attribute ref="employed"/>
      <xs:attribute ref="pid"/>
    </xs:complexType>
  </xs:element>

  <!-- Employment status: fulltime, parttime, or none -->
  <xs:attribute name="employed">
    <xs:simpleType>
      <xs:restriction base="string">
        <xs:enumeration value="fulltime"/>
        <xs:enumeration value="parttime"/>
        <xs:enumeration value="none"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:attribute>

  <!-- Number identifying the person (1-999999) -->
  <xs:attribute name="pid">
    <xs:simpleType>
      <xs:restriction base="integer">
        <xs:minInclusive value="1"/>
        <xs:maxInclusive value="999999"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:attribute>

  <!-- Age (0-200) -->
  <xs:element name="age">
    <xs:complexType>
      <xs:restriction base="integer">
        <xs:minInclusive value="0"/>
        <xs:maxInclusive value="200"/>
      </xs:restriction>
    </xs:complexType>
  </xs:element>

  <!-- Enumerated type: male or female -->
  <xs:element name="gender">
    <xs:simpleType>
      <xs:restriction base="string">
        <xs:enumeration value="female"/>
        <xs:enumeration value="male"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:element>

  <!-- structure containing the name; note the choice element
       that allows an optional junior OR senior element -->
  <xs:element name="name">
    <xs:complexType>
      <xs:all>
        <xs:element ref="first"/>
        <xs:element ref="last"/>
      </xs:all>
      <xs:choice minOccurs="0">
        <xs:element ref="junior"/>
        <xs:element ref="senior"/>
      </xs:choice>
    </xs:complexType>
  </xs:element>

  <xs:element name="junior" type="emptyElem"/>
  <xs:element name="senior" type="emptyElem"/>

  <!-- Defining a type of element that is empty -->
  <xs:complexType name="emptyElem"/>

</xs:schema>

Some notes:

Since XML Schema supports a variety of date formats for character data, it makes sense to replace the cumbersome date container and its three child elements with one that takes only text content. This simplifies the schema and supporting software for the census application.
I used an attribute maxOccurs to allow an unlimited number of person elements. Without it, the schema would allow no more than one such element.
A choice element is the opposite of all. Instead of requiring all the elements to be present, it will allow only one of the choices to appear. In this case, I wanted at most one of <junior/> or <senior/> to appear.
I set the minOccurs attribute in the choice to zero to make it optional. You can choose to use <junior/> or <senior/>, but you don’t have to.
Curiously, there is no type for empty elements. That’s why I had to define one, emptyElem for the elements junior and senior.

Tip

This is only scratching the surface of XML Schema, which offers an enormous variety of features, including type extension and restriction, lists, unions, namespace features, and much more. For more information on XML Schema, see Eric van der Vlist’s XML Schema (O’Reilly & Associates, 2002).

^[5]There is a complex parameter entity hack for creating DTDs that can cope with namespaces. Although the W3C has used it for both XHTML and SVG modularization, it’s both fragile and a huge readability problem.

Get Learning XML, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Learning XML, 2nd Edition by Erik T. Ray

W3C XML Schema

Tip

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly