BUY THIS BOOK
Add to Cart

Print Book $39.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint this content?


XML Schema
XML Schema The W3C's Object-Oriented Descriptions for XML By Eric van der Vlist
June 2002
Pages: 396

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Schema Uses and Development
XML, the Extensible Markup Language, lets developers create their own formats for storing and sharing information. Using that freedom, developers have created documents representing an incredible range of information, and XML can ease many different information-sharing problems. A key part of this process is formal declaration and documentation of those formats, providing a foundation on which software developers can build software.
An XML schema language is a formalization of the constraints, expressed as rules or a model of structure, that apply to a class of XML documents. In many ways, schemas serve as design tools, establishing a framework on which implementations can be built. Since formalization is a necessary ground for software designers, formalizing the constraints and structures of XML instance documents can lead to very diverse applications. Although new applications for schemas are being invented every day, most of them can be classified as validation, documentation, query, binding, or editing.
Validation is the most common use for schemas in the XML world. There are many reasons and opportunities to validate an XML document: when we receive one, before importing data into a legacy system, when we have produced or hand-edited one, to test the output of an application, etc. In all these cases, a schema helps to accomplish a substantial part of the job. Different kinds of schemas perform different kinds of validation, and some especially complex rules may be better expressed in procedural code rather than in a descriptive schema, but validation is generally the initial purpose of a schema, and often the primary purpose as well.
Validation can be considered a "firewall" against the diversity of XML. We need such firewalls principally in two situations: to serve as actual firewalls when we receive documents from the external world (as is commonly the case with Web Services and other XML communications), and to provide check points when we design processes as pipelines of transformations. By validating documents against schemas, you can ensure that the documents' contents conform to your expected set of rules, simplifying the code needed to process them.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Schemas Do for XML
An XML schema language is a formalization of the constraints, expressed as rules or a model of structure, that apply to a class of XML documents. In many ways, schemas serve as design tools, establishing a framework on which implementations can be built. Since formalization is a necessary ground for software designers, formalizing the constraints and structures of XML instance documents can lead to very diverse applications. Although new applications for schemas are being invented every day, most of them can be classified as validation, documentation, query, binding, or editing.
Validation is the most common use for schemas in the XML world. There are many reasons and opportunities to validate an XML document: when we receive one, before importing data into a legacy system, when we have produced or hand-edited one, to test the output of an application, etc. In all these cases, a schema helps to accomplish a substantial part of the job. Different kinds of schemas perform different kinds of validation, and some especially complex rules may be better expressed in procedural code rather than in a descriptive schema, but validation is generally the initial purpose of a schema, and often the primary purpose as well.
Validation can be considered a "firewall" against the diversity of XML. We need such firewalls principally in two situations: to serve as actual firewalls when we receive documents from the external world (as is commonly the case with Web Services and other XML communications), and to provide check points when we design processes as pipelines of transformations. By validating documents against schemas, you can ensure that the documents' contents conform to your expected set of rules, simplifying the code needed to process them.
Validation of documents can substantially reduce the risk of processing XML documents received from sources beyond your control. It doesn't remove either the need to follow the administration rules of your chosen communication protocol or the need to write robust applications, but it's a useful additional layer of tests that fits between the communications interface and your internal code.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
W3C XML Schema
XML 1.0 included a set of tools for defining XML document structures, called Document Type Definitions (DTDs). DTDs provide a set of tools for defining which element and attribute structures are permitted in a document, as well as mechanisms for providing default values for attributes, defining reusable content (entities), and some kinds of metadata information (notations). While DTDs are widely supported and used, many XML developers quickly outgrew the capabilities DTDs provide. An alternative schema proposal, XML-Data, was even submitted to the W3C before XML 1.0 was a Recommendation.
The World Wide Web Consortium (W3C), keeper of the XML specification, sought to build a new language for describing XML documents. It needed to provide more precision in describing document structures and their contents, to support XML namespaces, and to use an XML vocabulary to describe XML vocabularies. The W3C's XML Schema Working Group spent two years developing two normative Recommendations, XML Schema Part 1: Structures, and XML Schema Part 2: Datatypes, along with a nonnormative Recommendation, XML Schema Part 0: Primer.
W3C XML Schema is designed to support all of these applications. An initial set of requirements, formally described in the XML Schema Requirements Note (http://www.w3.org/TR/NOTE-xml-schema-req), listed a wide variety of usage scenarios for schemas as well as for the design principles that guided its creation.
In the rest of this book, we explore the details of W3C XML Schema and its many capabilities, focusing on how to apply it to specific XML document situations.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Our First Schema
Starting with a simple example (a limited number of elements and attributes and containing no namespaces), we will see how a first schema can be simply derived from the document structure, using a catalog of the elements in a document as we write a DTD for this document.
The instance document, which we use in the first part of this book, is a simple library file describing a book, its author, and its characters:
<?xml version="1.0"?> 
<library>
  <book id="b0836217462" available="true">
    <isbn>
      0836217462
    </isbn>
    <title lang="en">
      Being a Dog Is a Full-Time Job
    </title>
    <author id="CMS">
      <name>
        Charles M Schulz
      </name>
      <born>
        1922-11-26
      </born>
      <dead>
        2000-02-12
      </dead>
    </author>
    <character id="PP">
      <name>
        Peppermint Patty
      </name>
      <born>
        1966-08-22
      </born>
      <qualification>
        bold, brash and tomboyish
      </qualification>
    </character>
    <character id="Snoopy">
      <name>
        Snoopy
      </name>
      <born>
        1950-10-04
      </born>
      <qualification>
        extroverted beagle
      </qualification>
    </character>
    <character id="Schroeder">
      <name>
        Schroeder
      </name>
      <born>
        1951-05-30
      </born>
      <qualification>
        brought classical music to the Peanuts strip
      </qualification>
    </character>
    <character id="Lucy">
      <name>
        Lucy
      </name>
      <born>
        1952-03-03
      </born>
      <qualification>
        bossy, crabby and selfish
      </qualification>
    </character>
  </book>
</library>
We will see, in the course of this book, that there are many different styles for writing a schema, and there are even more approaches to deriving a schema from an instance document. For our first schema, we will adopt a style that is familiar to those of you who have already worked with DTDs. We'll start by creating a classified list of the elements and attributes found in the schema.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Instance Document
The instance document, which we use in the first part of this book, is a simple library file describing a book, its author, and its characters:
<?xml version="1.0"?> 
<library>
  <book id="b0836217462" available="true">
    <isbn>
      0836217462
    </isbn>
    <title lang="en">
      Being a Dog Is a Full-Time Job
    </title>
    <author id="CMS">
      <name>
        Charles M Schulz
      </name>
      <born>
        1922-11-26
      </born>
      <dead>
        2000-02-12
      </dead>
    </author>
    <character id="PP">
      <name>
        Peppermint Patty
      </name>
      <born>
        1966-08-22
      </born>
      <qualification>
        bold, brash and tomboyish
      </qualification>
    </character>
    <character id="Snoopy">
      <name>
        Snoopy
      </name>
      <born>
        1950-10-04
      </born>
      <qualification>
        extroverted beagle
      </qualification>
    </character>
    <character id="Schroeder">
      <name>
        Schroeder
      </name>
      <born>
        1951-05-30
      </born>
      <qualification>
        brought classical music to the Peanuts strip
      </qualification>
    </character>
    <character id="Lucy">
      <name>
        Lucy
      </name>
      <born>
        1952-03-03
      </born>
      <qualification>
        bossy, crabby and selfish
      </qualification>
    </character>
  </book>
</library>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Our First Schema
We will see, in the course of this book, that there are many different styles for writing a schema, and there are even more approaches to deriving a schema from an instance document. For our first schema, we will adopt a style that is familiar to those of you who have already worked with DTDs. We'll start by creating a classified list of the elements and attributes found in the schema.
The elements existing in our instance document are author, book, born, character, dead, isbn, library, name, qualification, and title, and the attributes are available, id, and lang.
We will build our first schema by defining each element in turn under our schema document element (named, unsurprisingly, schema), which belongs to the W3C XML Schema namespace (http://www.w3.org/2001/XMLSchema) and is usually prefixed as "xs."
Before we start, we need to classify the elements and, for this exercise, give some key definitions for understanding how W3C XML Schema does this classification. (We will see these definitions in more detail in the chapters about simple and complex types.)
The content model characterizes the types of children elements and text nodes that can be included in an element (without paying any attention to the attributes).
The content model is said to be "empty" when no children elements nor text nodes are expected, "simple" when only text nodes are accepted, "complex" when only subelements are expected, and "mixed" when both text nodes and sub-elements can be present. Note that to determine the content model, we pay attention only to the element and text nodes and ignore any attribute, comment, or processing instruction that could be included. For instance, an element with some attributes, a comment, and a couple of processing instructions would have an "empty" content model if it has no text or element children.
Elements such as name, born, and title have simple content models:
.../...
        
  <title lang="en">
    Being a Dog Is a Full-Time Job
  </title>
.../...
        
  <name>
    Charles M Schulz
  </name>
        
  <born>
    1922-11-26
  </born>
.../...
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
First Findings
Even in this very simple schema, we have learned a lot about what W3C XML Schema has to offer.
In this example, we defined simple components (elements and attributes in this case, but we will see in the next chapters how to define other kinds of components) that we used to build more complex components. This is one of the key principles that have guided the editors of W3C XML Schema. These editors have borrowed many concepts of object-oriented design to develop complex components.
If we draw a parallel between datatypes and classes, the elements and attributes can be compared to objects. Each of the component definitions that we included in our first schema is similar to an object. Referencing one of these components to build a new element is similar to creating a new object by cloning the already defined component.
In the next chapters, we will see how we can also create the components "in place" (where they are needed) as well as create datatypes from which we can derive elements and attributes the same way we can instantiate a class to create an object.
Note also that W3C XML Schema is pursuing two different levels of validation in this first example: we have defined both rules about the structure of the document and rules above the content of leaf nodes of the document. The W3C Recommendation makes a clear distinction between these two levels by publishing the recommendation in two parts (Part 1: Structures and Part 2: Datatypes), which are relatively independent.
There is also a big difference between simple types, which are about datatyping and constraining the content of leaf nodes in the tree structure of an XML document, and complex types, which are about defining the structure of a document.
Finally, note the flatness of this schema: each component (element or attribute) is defined directly under the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Giving Some Depth to Our First Schema
Our first schema was very flat, and all its components were defined at the top level. Our second attempt will give it more depth and show how local components may be defined.
For this second schema, we follow a style opposite from the one we used in Chapter 2, and we define all the elements and attributes locally where they appear in the document.
Following the document structure, we will start by defining our document element library. This element was defined in the earlier schema as:
<xs:element name="library">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="book" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>
In our new schema, we will keep the same construct and the same structure, but we will replace the reference to the book element with the actual definition of this element:
<xs:element name="library">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="book" maxOccurs="unbounded">
        <xs:complexType>
          <xs:sequence>
            <xs:element ref="isbn"/>
            <xs:element ref="title"/> 
            <xs:element ref="author" minOccurs="0"
              maxOccurs="unbounded"/> 
            <xs:element ref="character" minOccurs="0"
              maxOccurs="unbounded"/>
          </xs:sequence>
          <xs:attribute ref="id"/>
          <xs:attribute ref="available"/>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
  </xs:complexType>
</xs:element>
Because the definition of the book element is contained inside the definition of the library element, other definitions of book elements could be done at other locations in the schema without any risk of confusion—except maybe by human readers.
If all the elements and attributes still referenced in this schema are defined as global, this piece of schema is valid and accurately describes our schema. The only differences between the first schema and this intermediary step are that the definition of the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Working From the Structure of the Instance Document
For this second schema, we follow a style opposite from the one we used in Chapter 2, and we define all the elements and attributes locally where they appear in the document.
Following the document structure, we will start by defining our document element library. This element was defined in the earlier schema as:
<xs:element name="library">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="book" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>
In our new schema, we will keep the same construct and the same structure, but we will replace the reference to the book element with the actual definition of this element:
<xs:element name="library">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="book" maxOccurs="unbounded">
        <xs:complexType>
          <xs:sequence>
            <xs:element ref="isbn"/>
            <xs:element ref="title"/> 
            <xs:element ref="author" minOccurs="0"
              maxOccurs="unbounded"/> 
            <xs:element ref="character" minOccurs="0"
              maxOccurs="unbounded"/>
          </xs:sequence>
          <xs:attribute ref="id"/>
          <xs:attribute ref="available"/>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
  </xs:complexType>
</xs:element>
Because the definition of the book element is contained inside the definition of the library element, other definitions of book elements could be done at other locations in the schema without any risk of confusion—except maybe by human readers.
If all the elements and attributes still referenced in this schema are defined as global, this piece of schema is valid and accurately describes our schema. The only differences between the first schema and this intermediary step are that the definition of the book element cannot be reused elsewhere, and the book element can no longer be a document element any longer.
We can also reiterate the same operation and perform the definitions of all the elements and all the attributes locally:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
New Lessons
Although this schema describes the same document as the one in Chapter 2, it illustrates very different aspects of W3C XML Schema.
Even though we will present features to balance this fact in the next chapters— xs:complexType and xs:group—we have sacrificed the modularity of our first schema to gain the depth and structure of the second one. This is a general tendency in W3C XML Schema.
In practice, you will probably want to keep a balance between these two opposite styles and allow a certain level of depth under several global elements.
There are two cases, however, in which these two styles are not equivalent. The first is when elements with the same name need to be defined with different contents at different locations. In this case, local element definitions should be used (at least at all the location except one) since the elements are identified by their names.
In our example, the element name appears both within author and character with the same datatype. We may want to define the element name with different content models in author and character, as in this instance document:
<?xml version="1.0"?>
<library>
  <book id="b0836217462" available="true">
    <isbn>
      0836217462
    </isbn>
    <title lang="en">
      Being a Dog Is a Full-Time Job
    </title>
    <author id="CMS">
      <name>
        <first>
          Charles
        </first>
        <middle>
          M.
        </middle>
        <last>
          Schulz
        </last>
      </name>
      <born>
        1922-11-26
      </born>
      <dead>
        2000-02-12
      </dead>
    </author>
    <character id="Snoopy">
      <name>
        Snoopy
      </name>
      <born>
        1950-10-04
      </born>
      <qualification>
        extroverted beagle
      </qualification>
    </character>
  </book>
</library>
Since we can define only one global element named name, we need to define at least one of the name elements locally under its parent.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Using Predefined Simple Datatypes
W3C XML Schema provides an extensive set of predefined datatypes. W3C XML Schema derives many of these predefined datatypes from a smaller set of "primitive" datatypes that have a specific meaning and semantic and cannot be derived from other types. We will see how we can use these types to define our own datatypes by derivation to meet more specific needs in the next chapter.
Figure 4-1 provides a map of predefined datatypes and the relationships between them.
Figure 4-1: W3C XML Schema type hierarchy
W3C XML Schema introduced a decoupling between the data, as it can be read from the instance documents (the "lexical space"), and the value, as interpreted according to the datatype (the "value space").
Before we can enter into the definition of these two spaces, we must examine the processing model and the transformations endured by a value written in a XML document before it is validated. Element and attribute content proceeds through the following steps during processing:
Serialization space
The series of bytes that is actually stored in a document (either as the value of an attribute or as a text node) may be seen as belonging to a first space, which we may call the "serialization space."
Parsed space
The XML 1.0 Recommendation makes it clear that the serialization space is not directly meaningful to applications, and a first transformation is performed on the value by conforming XML parsers before the value reaches an application: characters are converted into Unicode, and ends of lines (for text nodes and attributes) and whitespaces (only for attributes) are normalized. The result of this transformation is what reaches the applications—including schema processors—and belongs to what we may call the "parsed space."
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Lexical and Value Spaces
W3C XML Schema introduced a decoupling between the data, as it can be read from the instance documents (the "lexical space"), and the value, as interpreted according to the datatype (the "value space").
Before we can enter into the definition of these two spaces, we must examine the processing model and the transformations endured by a value written in a XML document before it is validated. Element and attribute content proceeds through the following steps during processing:
Serialization space
The series of bytes that is actually stored in a document (either as the value of an attribute or as a text node) may be seen as belonging to a first space, which we may call the "serialization space."
Parsed space
The XML 1.0 Recommendation makes it clear that the serialization space is not directly meaningful to applications, and a first transformation is performed on the value by conforming XML parsers before the value reaches an application: characters are converted into Unicode, and ends of lines (for text nodes and attributes) and whitespaces (only for attributes) are normalized. The result of this transformation is what reaches the applications—including schema processors—and belongs to what we may call the "parsed space."
Lexical space
Before doing any validation, W3C XML Schema performs a second round of whitespace processing on this value reported by the XML parser. This depends on the value's datatype and may either ignore, normalize, or collapse the whitespaces. The value after this whitespace processing belongs to the "lexical space" defined in the W3C XML Schema Recommendation.
Value space
W3C XML Schema considers an item from the lexical space to be a representation of an abstract value whose meaning or semantic is defined by its datatype and can be a piece of text, and also a number, a date, or qualified name. The ensemble of abstract values is defined as the "value space."
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Whitespace Processing
The handling of special characters (tab, linefeeds, carriage returns and spaces, which are often used only to "pretty print" XML documents) has always been very controversial. W3C XML Schema has imposed a two-step generic algorithm, which is applied to most of the predefined datatypes (actually, on all of them except two, xs:string and xs:normalizedString).
Whitespace replacement
This is the first step of whitespace processing applied to the parsed value. During whitespace replacement, all occurrences of any whitespace—#x9 (tab), #xA (linefeed), and #xD (carriage return)—are replaced with a space (#x20). The number of characters is not changed by this step, which is applied to all the predefined datatypes (except for xs:string , since no whitespace replacement is performed on the parsed value for this).
Whitespace collapse
The second step removes the leading and trailing spaces, and replaces all contiguous occurrences of spaces by a single space character. This is applied on all the predefined datatypes (except for xs:string , since no whitespace replacement is performed on the parsed value for this, and for xs:normalizedString , in which whitespaces are only normalized).
This notion of "normalized string" does not match the XPath function normalize-space( ), which corresponds with what W3C XML Schema calls whitespace collapsing. It is also different from the DOM normalize() method, which is a merge of adjacent text objects.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
String Datatypes
This section discusses datatypes derived from the xs:string primitive datatype as well as other datatypes that have a similar behavior (namely, xs:hexBinary , xs:base64Binary , xs:anyURI , xs:QName , and xs:NOTATION ). These types are not expected to carry any quantifiable value (W3C XML Schema doesn't even expect to be able to sort them) and their value space is identical to their lexical space except when explicitly described otherwise. One should note that even though they are grouped in this section because they have a similar behavior, these primitive datatypes are considered quite different by the Recommendation.
The datatypes covered in this section are shown in Figure 4-2.
Figure 4-2: Strings and similar datatypes
The two exceptions in whitespace processing ( xs:string and xs:normalizedString ) are string datatypes. One of the main differences between these types is the applied whitespace processing. To stress this difference, we will classify these types by their whitespace processing.
xs:string
This string datatype is the only predefined datatype for which no whitespace replacement is performed. As we will see in the next chapter, the whitespace replacement performed on user-defined datatypes derived from this type can be defined without restriction. On the other hand, a user datatype cannot be defined as having no whitespace replacement if it is derived from any predefined datatype other than xs:string .
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Numeric Datatypes
The numeric datatypes are built on top of four primitive datatypes: xs:decimal for all the decimal types (including the integer datatypes, considered decimals without a fractional part), xs:double and xs:float for single and double precision floats, and xs:boolean for Booleans. Whitespaces are collapsed for all these datatypes.
The datatypes covered in this section are shown in Figure 4-3.
Figure 4-3: Numeric datatypes
All decimal types are derived from the xs:decimal primary type and constitute a set of predefined types that address the most common usages.
xs:decimal
This datatype represents the decimal numbers. The number of digits can be arbitrarily long (the datatype doesn't impose any restriction), but obviously, since a XML document has an arbitrary but finite length, the number of digits of the lexical representation of a xs:decimal value needs to be finite. Although the number of digits is not limited, we will see in the next chapter how the author of a schema can derive user-defined datatypes with a limited number of digits if needed.
Leading and trailing zeros are not significant and may be trimmed. The decimal separator is always a dot ("."); a leading sign ("+" or "-") may be specified and any characters other than the 10 digits (including whitespaces) are forbidden. Scientific notation ("E+2") is also forbidden and has been reserved to the float datatypes only.
Valid values for xs:decimal
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Date and Time Datatypes
The datatypes covered in this section are shown in Figure 4-4.
Figure 4-4: Date and time datatypes
The W3C Recommendation, "XML Schema Part 2: Datatypes," provides new confirmation of how difficult it is to fix time.
The support for date and time datatypes relies entirely on a subset of the ISO 8601 standard, which is the only format supported by W3C XML Schema. The purpose of ISO 8601 is to eliminate the risk of confusion between the various date and time formats used in different countries. In other words, W3C XML Schema does not support these local date and time formats, and imposes the usage of ISO 8601 for any datatype that has the semantic of a date or time. While this is a good thing for interchange formats, this is more questionable when XML is used to define user interfaces, since we will see that ISO 8601 is not very user friendly. The variations using the names of the months or different orders between year, month, and day are not the only victims of this decision: ISO 8601 imposes the usage of the Gregorian (Christian) calendar to the exclusion of calendars used by other cultures or religions.
ISO 8601 describes several formats to define date, times, periods, and recurring dates, with different levels of precision and indetermination. After many discussions, W3C XML Schema selected a subset of these formats and created a primitive datatype for each format that is supported.
The indeterminacy allowed in some of these formats adds a lot of difficulty, especially when comparisons or arithmetic are involved. For instance, it is possible to define a point in time without specifying the time zone, which is then considered undetermined. This undetermined time zone is identical all over the document (and between the schema and the instance documents) and it's not an issue to compare two datetimes without a time zone. The problem arises when you need to compare two points in time, one with a time zone and the other without. The result of this comparison will be undetermined if these values are too close, since one of them may be between -13 hours and +12 hours of Coordinated Universal Time (UTC). Thus, the support of these datetime datatypes introduces a notion of "partial order relation."
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
List Types
These datatypes are lists of whitespace-separated items. The type of these items (called the item type) is defined during the derivation process (which we will see in the next chapter) and list datatypes can be derived from any simple type. Three predefined datatypes are lists ( xs:NMTOKENS , xs:IDREFS , and xs:ENTITIES ). For all the list datatypes, the items must be separated by one or more whitespaces.
xs:NMTOKENS
This is a whitespace-separated list of xs:NMTOKEN . Each item of the list must be in the lexical space of xs:NMTOKEN .
xs:IDREFS
This is a whitespace-separated list of xs:IDREF . Each item of the list must be in the lexical space of xs:IDREF and must reference an existing xs:ID in the same document.
xs:ENTITIES
This is a whitespace-separated list of xs:ENTITY . Each item of the list must be in the lexical space of xs:ENTITY and must match an unparsed entity defined in a DTD.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What About anySimpleType?
We have now covered all the predefined datatypes except one, which is an atypical type: anySimpleType. This datatype is a kind of wildcard, which means, as expected, that any value is accepted and doesn't add any constraint on the lexical space.
anySimpleType has two other characteristics that make it unique among simple types: users' simple types cannot be derived from it and its properties, and its canonical form is not defined in the Recommendation! These characteristics make it a type that should be avoided, except when the rules of a derivation (which we will see in the next chapter) require its usage.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Back to Our Library
If we look back with a critical eye at our library, we see we used the following simple datatypes:
<xs:element name="name" type="xs:string"/>
      
<xs:element name="qualification" type="xs:string"/>
      
<xs:element name="born" type="xs:date"/>
      
<xs:element name="dead" type="xs:date"/>
      
<xs:element name="isbn" type="xs:string"/>
      
<xs:attribute name="id" type="xs:ID"/>
      
<xs:attribute name="available" type="xs:boolean"/>
      
<xs:attribute name="lang" type="xs:language"/>
We are lucky that the elements born and dead are ISO 8601 dates. The ISBN number is composed of numeric digits and a final character which can be either a digit or the letter "x"-and is therefore represented as a string. We also did a good job with the datatypes for the id, available and lang attributes, but the choice of xs:string for the elements name and qualification is more controversial. They appear in the instance document as:
<name>
  Charles M Schulz
</name>
                      .../...

<qualification>
  bold, brash and tomboyish
</qualification>
This formatting suggests that whitespaces are probably not significant and should be collapsed. This can be done by choosing the datatype xs:token instead of xs:string ; the same applies to the title element, which is a simple content derived from xs:string that would be better derived from xs:token . This change will not have any impact on the validation with our schema, but the document is more precisely described and future derivations would be more easily built on xs:token than on xs:string . The other datatype that could have been chosen better is isbn, which can be represented as xs:NMTOKEN. The new schema would then be:
<?xml version="1.0"?> 
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="name" type="xs:token"/>
  <xs:element name="qualification" type="xs:token"/>
  <xs:element name="born" type="xs:date"/>
  <xs:element name="dead" type="xs:date"/>
  <xs:element name="isbn" type="xs:NMTOKEN"/>
  <xs:attribute name="id" type="xs:ID"/>
  <xs:attribute name="available" type="xs:boolean"/>
  <xs:attribute name="lang" type="xs:language"/>
  <xs:element name="title">
    <xs:complexType>
      <xs:simpleContent>
        <xs:extension base="xs:token">
          <xs:attribute ref="lang"/>
        </xs:extension>
      </xs:simpleContent>
    </xs:complexType>
  </xs:element>
  <xs:element name="library">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="book" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="author">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="name"/>
        <xs:element ref="born"/>
        <xs:element ref="dead" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="book">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="isbn"/>
        <xs:element ref="title"/>
        <xs:element ref="author" minOccurs="0" maxOccurs="unbounded"/> 
        <xs:element ref="character" minOccurs="0"
          maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
      <xs:attribute ref="available"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="character">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="name"/>
        <xs:element ref="born"/>
        <xs:element ref="qualification"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: Creating Simple Datatypes
So far, we have used only predefined datatypes. In this chapter, we will see how to create new simple types, taking advantage of the different derivation mechanisms and facets of derivation by restriction.
W3C XML Schema has defined three independent and complementary mechanisms for defining our own custom datatypes, using existing datatypes as starting points. These new user datatypes that are built upon existing predefined datatypes or on other user datatypes are called "derivation."
The three derivation methods are derivation by restriction (where constraints are added on a datatype without changing its original semantic or meaning), derivation by list (where new datatypes are defined as being lists of values belonging to a datatype and take the semantic of list datatypes), and derivation by union (where new datatypes are defined as allowing values from a set of other datatypes and lose most of their semantic).
As with the xs:complexType, definitions (which we saw in our Russian doll design) and xs:simpleType(global definition) can be either named or anonymous. Despite this similarity, simple and complex types are very different. A simple type is a restriction on the value of an element or an attribute (i.e., a constraint on the content of a set of documents) while a complex type is a definition of a content model (i.e., a constraint on the markup). This is why the derivation methods for simple and complex types are very different, even though W3C XML Schema used the same element name (xs:restriction) for both. This is a common source of confusion.
These derivation methods are flexible and powerful. However, that W3C XML Schema needs many different primary datatypes can be seen as proof that they are not sufficient to create a new primary datatype. The reason being that the derivation methods are only acting on the value space or on the lexical space (as defined in Chapter 4), but they cannot modify the relations between these two spaces, nor create new value or lexical spaces. This subject has been debated by the W3C XML Schema Working Group, which has not found an agreement for ways to define an abstract datatype system that would allow definition of several lexical representations. The most obvious consequence of this decision is that, despite the protestation from the W3C I18N Working Group,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Derivation By Restriction
Restriction is probably the most commonly used and natural derivation method.Datatypes are created by restriction by adding new constraints to the possible values. W3C XML Schema itself has been using derivation by restriction to define most of derived predefined datatypes, such as xs:positiveInteger , which is a derivation by restriction of xs:integer . The restrictions can be defined along different aspects or axes that W3C XML Schema calls "facets."
A derivation by restriction is done using a xs:restriction element and each facet is defined using a specific element embedded in the xs:restriction element. The datatype on which the restriction is applied is called the base datatype, which can be referenced through a <base> attribute or defined in the xs:restriction element:
<xs:simpleType name="myInteger">
  <xs:restriction base="xs:integer">
    <xs:minInclusive value="-2"/>
    <xs:maxExclusive value="5"/>
  </xs:restriction>
</xs:simpleType>
It can also be defined in two steps using an embedded xs:simpleType(global definition) anonymous definition:
<xs:simpleType name="myInteger">
  <xs:restriction>
    <xs:simpleType>
      <xs:restriction base="xs:integer">
        <xs:maxExclusive value="5"/>
      </xs:restriction>
    </xs:simpleType>
    <xs:minInclusive value="-2"/>
  </xs:restriction>
</xs:simpleType>
The xs:minInclusive and xs:maxExclusive elements are two facets that can be applied to an integer datatype. As can be guessed from their names, they specify the minimum inclusive (i.e., that can be reached) and maximum exclusive (i.e., that is not allowed) values. We will introduce the list of facets in the next section. Depending on the facet, each acts directly either on the value space or on the lexical space of the datatype, and the same facet may have different effects depending on the datatype on which it is applied.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Derivation By List
Derivation by list is the mechanism by which a list datatype can be derived from an atomic datatype. All the items in the list need to have the same datatype.
List datatypes are special cases in which a structure is defined within the content of a single attribute or element. This practice is usually discouraged since applications do not have access to the atomic values through the current XML APIs, XPath expressions, or in the Infoset. This situation might change in the future since these datatypes should be adopted by XPath 2.0, which will likely provide some kind of mechanism to access to the items within these lists.
This feature appears to have been introduced to maintain compatibility with SGML and XML DTD IDREFS, but W3C XML Schema has been cautious and doesn't allow definition of the list separator or complex lists with complex types or heterogeneous members. Among the constructs that can be seen in some XML vocabularies and cannot be described by XML Schema (except by using regular expressions as a partial workaround) are comma-separated lists of values, and lists with heterogeneous members, such as values with units:
<commaSeparated>
  1, 2, 25
</commaSeparated>
  
<valueWithUnit>
  10 em
</valueWithUnit>
Whitespace-separated lists and split XML elements or attributes are preferred:
<commaSeparated>
  1 2 25
</commaSeparated>
  
<valueWithUnit unit="em">
  10
</valueWithUnit>
              
<valueWithUnit>
  10em
</valueWithUnit>
IDREFS, ENTITIES, and NMTOKENS are predefined list datatypes that are derived from atomic types using this method.
As we have seen with these three datatypes, all the list datatypes that can be defined must be whitespace-separated. No other separator is accepted.
With this restriction, defining a list is very simple, and W3C XML Schema has defined two syntaxes. Both use a xs:list element, which allows a definition by reference to existing types or embeds a type definition (these two syntaxes cannot be mixed).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Derivation By Union
Derivation by union allows defining datatypes by merging the lexical spaces of several predefined or user datatypes.
As we've seen with the derivation by list, W3C XML Schema has defined two syntaxes, both using a xs:union element, allowing a definition by reference to existing types or by embedding type definition (these two syntaxes can be mixed). The definition of a union datatype by reference to existing types is done through a memberType attribute containing a whitespace-separated list of datatypes:
<xs:simpleType name="integerOrDate">
  <xs:union memberTypes="xs:integer xs:date"/>
</xs:simpleType>
The definition of a union datatype can also be done by embedding one or more <xs:simpleType> elements:
<xs:simpleType name="myIntegerUnion">
  <xs:union>
    <xs:simpleType>
      <xs:restriction base="xs:integer"/>
    </xs:simpleType>
    <xs:simpleType>
      <xs:restriction base="xs:NMTOKEN">
        <xs:enumeration value="undefined"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:union>
</xs:simpleType>
Both styles can be mixed and the previous example can be written as:
<xs:simpleType name="myIntegerUnion">
  <xs:union memberTypes="xs:integer">
    <xs:simpleType>
      <xs:restriction base="xs:NMTOKEN">
        <xs:enumeration value="undefined"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:union>
</xs:simpleType>
The resulting datatype is a merge that, as a whole, has lost the semantical meaning—and facets—from the member types. In the earlier example, we couldn't constrain the myIntegerUnion type to be either less than 100 or undefined except by defining a pattern. To do so, we can create a type derived by restriction from a built-in type to be less than 100, and perform the union to allow the value to be "undefined" afterward. The only two facets that can be applied to a union datatype are xs:pattern and xs:enumeration .
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Some Oddities of Simple Types
While simple types are structurally simple, they still have some complications worth watching for.
The order of the different derivation methods (restriction, list, or union) is significant.
We have already seen that derivation by list and union lose the semantic meaning of the types and their facets, which are replaced by a common set of facets with their own meaning ( xs:length , xs:maxLength , xs:minLength , xs:enumeration , and xs:whiteSpace for derivation by list, and xs:pattern and xs:enumeration for derivation by union).
This means that all the restrictions on the atomic or member types must be done before the derivation by list or members (as we have seen in the corresponding sections for the facets) and that a new restriction can then be performed using the common set of facets.
The order between derivation by list and derivation by union depends on the result to achieve, as a list of unions is different from a union of lists, as one might expect:
<xs:simpleType name="listOfUnions">
  <xs:list>
    <xs:simpleType>
      <xs:union memberTypes="xs:date xs:integer"/>
    </xs:simpleType>
  </xs:list>
</xs:simpleType>

<xs:simpleType name="UnionOfLists">
  <xs:union>
    <xs:simpleType>
      <xs:list itemType="xs:date"/>
    </xs:simpleType>
    <xs:simpleType>
      <xs:list itemType="xs:integer"/>
    </xs:simpleType>
  </xs:union>
</xs:simpleType>
These two datatypes match the following:
<UnionOfLists>
  2001-01-01 2001-01-02
</UnionOfLists>

<UnionOfLists>
  1 2 3
</UnionOfLists>

<ListOfUnions>
  2001-01-01 2001-01-02
</ListOfUnions>

<ListOfUnions>
  1 2 3
</ListOfUnions>

<ListOfUnions>
  2001-01-01 1 2
</ListOfUnions>
But don't match:
<UnionOfLists>
  2001-01-01 1 2
</UnionOfLists>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Back to Our Library
Let's see how we can improve our schema by adding constraints on our datatypes with what we have learned in this chapter:
<xs:element name="name" type="xs:token"/>
      
<xs:element name="qualification" type="xs:token"/>
      
<xs:element name="born" type="xs:date"/>
      
<xs:element name="dead" type="xs:date"/>
      
<xs:element name="isbn" type="xs:NMTOKEN"/>
      
<xs:attribute name="id" type="xs:ID"/>
      
<xs:attribute name="available" type="xs:boolean"/>
      
<xs:attribute name="lang" type="xs:language"/>
First, we may want to limit the size of our strings—for instance, if they must be stored into fixed-length columns in an RDBMS. Here, we will consider that the name needs to fit in a string of 32 characters, and the title and qualification need to fit in strings of 255 characters. We create two simple datatypes for this:
<xs:simpleType name="string255">
  <xs:restriction base="xs:token">
    <xs:maxLength value="255"/>
  </xs:restriction>
</xs:simpleType>
      
<xs:simpleType name="string32">
  <xs:restriction base="xs:token">
    <xs:maxLength value="32"/>
  </xs:restriction>
</xs:simpleType>
Then, we may want to add some constraints on the ISBN number. The best we can do without using the patterns (we will see how to do this in the next chapter) is to limit the number of characters to 10 using xs:length . This facet is a number of characters and acts on the value space. This, therefore, does not eliminate instances such as ABCDEFGHIJ, but this is probably the best we can do for the moment:
<xs:simpleType name="isbn">
  <xs:restriction base="xs:NMTOKEN">
    <xs:length value="10"/>
  </xs:restriction>
</xs:simpleType>
We may finally want to limit the languages in which the title may be written. If our library only has titles in English and Spanish, we can add the following restriction:
<xs:simpleType name="supportedLanguages">
  <xs:restriction base="xs:language">
    <xs:enumeration value="en"/>
    <xs:enumeration value="es"/>
  </xs:restriction>
</xs:simpleType>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 6: Using Regular Expressions to Specify Simple Datatypes
Among the different facets available to restrict the lexical space of simple datatypes, the most flexible (and also the one that we will often use as a last resort when all the other facets are unable to express the restriction on a user-defined datatype) is based on regular expressions.
Patterns (and regular expressions in general) are like a Swiss army knife when constraining simple datatypes. They are highly flexible, can compensate for many of the limitations of the other facets, and are often used to define user datatypes on various formats such as ISBN numbers, telephone numbers, or custom date formats. However, like a Swiss army knife, patterns have their own limitations.
Multirange datatypes (such as integers between -1 and 5 or 10 and 15) can be defined as a union of datatypes meeting the different ranges (in this case, we could perform a union between a datatype accepting integers between -1 and 5 and a second datatype accepting integers between 10 and 15); however, after the union, the resulting datatype loses its semantic of integer and cannot be constrained using integer facets any longer. Using patterns to define multirange datatypes is therefore an option: although less readable than using an union, it preserves the semantic of the base type.
Cutting a tree with a Swiss army knife is long, tiring, and dangerous. Writing regular expressions may also become long, tiring, and dangerous when the number of combinations grows. One should try to keep them as simple as possible.
A Swiss army knife cannot change lead into gold, and no facet can change the primary type of a simple datatype. A string datatype restricted to match a custom date format will still retain the properties of a string and will never acquire the facets of a datetime datatype. This means that there is no effective way to express localized date formats.
In their simplest form, patterns may be used as enumerations applied to the lexical space rather than on the value space.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Swiss Army Knife
Patterns (and regular expressions in general) are like a Swiss army knife when constraining simple datatypes. They are highly flexible, can compensate for many of the limitations of the other facets, and are often used to define user datatypes on various formats such as ISBN numbers, telephone numbers, or custom date formats. However, like a Swiss army knife, patterns have their own limitations.
Multirange datatypes (such as integers between -1 and 5 or 10 and 15) can be defined as a union of datatypes meeting the different ranges (in this case, we could perform a union between a datatype accepting integers between -1 and 5 and a second datatype accepting integers between 10 and 15); however, after the union, the resulting datatype loses its semantic of integer and cannot be constrained using integer facets any longer. Using patterns to define multirange datatypes is therefore an option: although less readable than using an union, it preserves the semantic of the base type.
Cutting a tree with a Swiss army knife is long, tiring, and dangerous. Writing regular expressions may also become long, tiring, and dangerous when the number of combinations grows. One should tr