W3C XML Schema
DTDs are chiefly directed toward describing how elements are arranged in a document. They say very little about the content in the document, other than whether an element can contain character data. Although attributes can be declared to be of different types (e.g. ID, IDREF, enumerated), there is no way to constrain the type of data in an element.
Returning to the example in Section 4.2.3, we can see how this limitation can be a serious problem. Suppose that a census taker submitted the document in Example 4-5.
<census-record taker="9170"> <date><month>?</month><day>110</day><year>03</year></date> <address> <city>Munchkinland</city> <street></street> <county></county> <country>Here, silly</country> <postalcode></postalcode> </address> <person employed="fulltime" pid="?"> <name> <last>Burgle</last> <first>Brad</first> </name> <age>2131234</age> <gender>yes</gender> </person> </census-record>
There are a lot of things wrong with this document. The date is in the wrong format. Several important fields were left empty. The stated age is an impossibly large number. The gender, which ought to be “male” or “female,” contains something else. The personal identification number has a bad value. And yet, to our infinite dismay, the DTD would pick up none of these problems.
It isn’t hard to write a program that would check the data types, but that’s a low-level operation, prone to bugs and requiring technical ability. It’s also getting away from the point of DTDs, which is to create a kind of metadocument, a formal description of a markup language. Programming languages aren’t portable and don’t work well as a way of conveying syntactic and semantic details. So we have to conclude that DTDs don’t go far enough in describing a markup language.
To make matters worse, what the DTD will reject as bad markup are
often trivial things. For example, the contents of date
and name
are not in the specific order required by
their element declarations. This seems unnecessarily picayune, but it’s
actually very difficult to write a content model that allows its
children to appear in any order.
To illustrate the problem, let’s try to make the date
more flexible, so that it accepts
children in any order. The best I can think of is to write the
declaration like this:
<!ELEMENT date ( (year, ((month, day) | (day, month))) | (month, ((year, day) | (day, year))) | (day, ((month, year) | (year, month))) )>
Pretty ugly, isn’t it? And that’s only with three child elements. This is another serious drawback of DTDs.
Perhaps the most damaging limitation of DTDs is the lockdown of namespace. Any element in a document has to have a corresponding declaration in the DTD. No exceptions. This is fundamentally at odds with XML namespaces, which allow you to import vocabularies from anywhere. Granted, there are good reasons to want to limit the kinds of elements used: more efficient validation and preventing illegal elements from appearing. But there’s no way to turn this feature off if you don’t want it.[5]
To address problems like these, a new validation system was invented called schema. Like DTDs, schemas contain rules that all must be satisfied for a document to be considered valid. Unlike DTDs, however, schemas are not built into the XML specification. They are an add-on technology that you can use, provided you have access to parsers that will support it.
There are several competing kinds of schema. The one that is sanctioned by the W3C is called XML Schema. Another proposal, called RELAX NG, adds capabilities not found in XML Schema, such as regular expression matching in character data. Yet another popular alternative is Schematron. We’ll focus on the W3C variety in this section and visit alternatives in later sections.
XML Schemas are themselves XML documents. That’s a nice convenience, allowing you to check well-formedness and validity when you make modifications to a schema. It’s more verbose than a DTD, but still pretty readable and vastly more flexible.
From the census example, here is how you would define the county
element:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="county" type="xs:string"/> </xs:schema>
The xs:element
element acts
like an !ELEMENT
declaration in a
DTD. Its name
attribute declares a
name (“county”), and its type
attribute defines a content model by referring to a data
type identifier. Instead of using a compact string of symbols
to define a content model, schemas define content models in a separate
place, then refer to them inside the element definition. This is kind of
like using parameter entities, but as we will see it’s more flexible in
schemas.
xs:string
refers to a simple type of element, one that
is built in to the Schema specification. In this case, it’s just a
string of character data, about as simple as you can get. An alternative
to xs:string
is xs:token
. It also contains a string, but
normalizes the space (strips out leading and trailing space and
collapses extra space characters) for you. Table 4-1 lists other simple
types which are commonly used in schemas. There are many more types in
W3C XML Schema Part 2: Datatypes, but this core covers most frequent
needs and will get you started.
Type | Usage |
| Contains any text. |
| Contains textual tokens separated by whitespace. |
| Contains a namespace-qualified name. |
| Contains a decimal number of arbitrary precision. (Processors must support a minimum of 18 digits.) “3.252333”, “-1.01”, and “+20” are all acceptable values. |
| Contains an integer number, like “0”, “35”, or “-1433322” |
| Contains a 32-bit IEEE 754 floating point number. |
| Behave the same as the ID, IDREF, IDREFS in DTDs. |
| Contains a true or false value, expressed as “true” or “false” or “1” or “0”. |
| Contains a time in ISO 8601 format (HH:MM:SS-Timezone), like 21:55:00-06:00. |
| Contains a date in ISO 8601 format (CCYY-MM-DD), like 2004-12-30. |
| Contains a date/time combination in ISO 8601 format (CCYY-MM-DDTHH:MM:SS-Timezone), like 2004-12-30T21:55:00-06:00. |
Most elements are not simple, however. They can contain elements,
attributes, and character data with specialized formats. So schemas also
contain complex element definitions. Here’s how you could define the
date
element:
<xs:element name="date"> <xs:complexType> <xs:all> <xs:element ref="year"/> <xs:element ref="month"/> <xs:element ref="day"/> </xs:all> </xs:complexType> </xs:element> <xs:element name="year" type="xs:integer"/> <xs:element name="month" type="xs:integer"/> <xs:element name="day" type="xs:integer"/>
The date
element is a
complex type because it has special requirements that you must
explicitly define. In this case, the type is a group of three elements
(in any order), referred to by name using the ref
attribute. These referred elements are
defined at the bottom to be of type integer
.
It is possible to refine the date even further. Although the
schema will guarantee that each of the subfields year
, month
, and day
are integer values, it will allow some
values we don’t want. For example, -125724 is a valid integer, but we
wouldn’t want that to be used for month
.
The way to control the range of a data type is to use
facets. A facet is an additional parameter added
to a type definition. You can create a new data type for the <month>
element like this:
<xs:simpleType name="monthNum"> <xs:restriction base="xs:integer"> <xs:minInclusive value="1"/> <xs:maxInclusive value="12"/> </xs:restriction> </xs:simpleType> <xs:element name="month" type="monthNum"/>
Here, we created a named type and named it monthNum
. Named types are not bound to any
particular element, so they are useful if you’ll be using the same type
over and over. In this type definition is an xs:restriction
element from which we will
derive a more specific type than the loose xs:integer
. Inside are two facets, minInclusive
and maxInclusive
, setting the lower and upper
bounds respectively. Any element set to the type monthNum
will be checked to ensure its value
is a number that falls inside that range.
Besides setting ranges, facets can create fixed values, constrain the length of strings, and match patterns with regular expressions. For example, say you want the postal code to be any string that contains three digits followed by three letters, as in the census example:
<postalcode>885JKL</postalcode>
A pattern to match this is [0-9][0-9][0-9][A-Z][A-Z][A-Z]
. Even better:
[0-9]{3}[A-Z]{3}
. Here is how the
schema element might look:
<xs:element name="postalcode" type="pcode"/> <xs:simpleType name="pcode"> <xs:restriction base="xs:token"> <xs:pattern value="[0-9]{3}[A-Z]{3}"/> </xs:restriction> </xs:simpleType>
Another way to define a type is by
enumeration , defining a set of allowed values. The gender
element, for example, may only contain
two values: female
or male
. Here’s a gender type:
<xs:simpleType name="genderType"> <xs:restriction base="xs:token"> <xs:enumeration value="female"/> <xs:enumeration value="male"/> </xs:restriction> </xs:simpleType>
Now, let me show you how I would write a schema for the CensusML document type. Example 4-6 shows my attempt.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <!-- document element --> <xs:element name="census-record"> <xs:complexType> <xs:sequence> <xs:element ref="date"/> <xs:element ref="address"/> <xs:element ref="person" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute ref="taker"/> </xs:complexType> </xs:element> <!-- Number identifying the census taker (1-9999) --> <xs:attribute name="taker"> <xs:simpleType> <xs:restriction base="integer"> <xs:minInclusive value="1"/> <xs:maxInclusive value="9999"/> </xs:restriction> </xs:simpleType> </xs:attribute> <!-- structure containing date information --> <!-- this is a simplification over the previous definition using --> <!-- three subelements. --> <xs:element name="date" type="date"/> <!-- structure containing address information --> <xs:element name="address"> <xs:complexType> <xs:all> <xs:element ref="street"/> <xs:element ref="city"/> <xs:element ref="county"/> <xs:element ref="country"/> <xs:element ref="postalcode"/> </xs:all> </xs:complexType> </xs:element> <xs:element name="street" type="string"/> <xs:element name="city" type="string"/> <xs:element name="county" type="string"/> <xs:element name="country" type="string"/> <!-- postalcode element: uses format 123ABC --> <xs:element name="postalcode"> <xs:simpleType> <xs:restriction base="string"> <xs:pattern value="[0-9]{3}[A-Z]{3}"/> </xs:restriction> </xs:simpleType> </xs:element> <!-- structure containing data for one resident of the household --> <xs:element name="person"> <xs:complexType> <xs:all> <xs:element ref="name"/> <xs:element ref="age"/> <xs:element ref="gender"/> </xs:all> <xs:attribute ref="employed"/> <xs:attribute ref="pid"/> </xs:complexType> </xs:element> <!-- Employment status: fulltime, parttime, or none --> <xs:attribute name="employed"> <xs:simpleType> <xs:restriction base="string"> <xs:enumeration value="fulltime"/> <xs:enumeration value="parttime"/> <xs:enumeration value="none"/> </xs:restriction> </xs:simpleType> </xs:attribute> <!-- Number identifying the person (1-999999) --> <xs:attribute name="pid"> <xs:simpleType> <xs:restriction base="integer"> <xs:minInclusive value="1"/> <xs:maxInclusive value="999999"/> </xs:restriction> </xs:simpleType> </xs:attribute> <!-- Age (0-200) --> <xs:element name="age"> <xs:complexType> <xs:restriction base="integer"> <xs:minInclusive value="0"/> <xs:maxInclusive value="200"/> </xs:restriction> </xs:complexType> </xs:element> <!-- Enumerated type: male or female --> <xs:element name="gender"> <xs:simpleType> <xs:restriction base="string"> <xs:enumeration value="female"/> <xs:enumeration value="male"/> </xs:restriction> </xs:simpleType> </xs:element> <!-- structure containing the name; note the choice element that allows an optional junior OR senior element --> <xs:element name="name"> <xs:complexType> <xs:all> <xs:element ref="first"/> <xs:element ref="last"/> </xs:all> <xs:choice minOccurs="0"> <xs:element ref="junior"/> <xs:element ref="senior"/> </xs:choice> </xs:complexType> </xs:element> <xs:element name="junior" type="emptyElem"/> <xs:element name="senior" type="emptyElem"/> <!-- Defining a type of element that is empty --> <xs:complexType name="emptyElem"/> </xs:schema>
Some notes:
Since XML Schema supports a variety of date formats for character data, it makes sense to replace the cumbersome
date
container and its three child elements with one that takes only text content. This simplifies the schema and supporting software for the census application.I used an attribute
maxOccurs
to allow an unlimited number ofperson
elements. Without it, the schema would allow no more than one such element.A
choice
element is the opposite ofall
. Instead of requiring all the elements to be present, it will allow only one of the choices to appear. In this case, I wanted at most one of<junior/>
or<senior/>
to appear.I set the
minOccurs
attribute in the choice to zero to make it optional. You can choose to use<junior/>
or<senior/>
, but you don’t have to.Curiously, there is no type for empty elements. That’s why I had to define one,
emptyElem
for the elementsjunior
andsenior
.
Tip
This is only scratching the surface of XML Schema, which offers an enormous variety of features, including type extension and restriction, lists, unions, namespace features, and much more. For more information on XML Schema, see Eric van der Vlist’s XML Schema (O’Reilly & Associates, 2002).
[5] There is a complex parameter entity hack for creating DTDs that can cope with namespaces. Although the W3C has used it for both XHTML and SVG modularization, it’s both fragile and a huge readability problem.
Get Learning XML, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.