Chapter 4. XML Schema in XForms

“Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information on it.”

Samuel Johnson

Forms and datatypes always seem to be mentioned together. It’s natural to think of data entry in terms of specific types, such as date or phone number. Despite a feint in the opposite direction taken by earlier drafts, XForms incorporates the datatypes defined in W3C XML Schema. This chapter discusses these datatypes, and describes the general framework for describing and defining custom datatypes.

Wide Open (Value) Spaces

In describing a datatype, XML Schema distinguishes between a lexical space, or the data as it appears in XML, and a value space, or the data as it exists on an abstract level. In practice, many datatypes have a one-to-one mapping between the lexical space and the value space, so the distinction can seem a little academic. It is important, however, when there are equivalent representations for some value. For instance, the boolean datatype can represent true as either 1 or true, (and false as either 0 or false). Even though there are multiple possible representations, they both map to the underlying concept of trueness and falseness, respectively. This is important when comparing values; the value space is used as the basis for comparison.

Many observers have pointed out that the lexical representations of some XML Schema datatypes aren’t very user friendly. As an example, the duration of a day and an hour is P1DT1H. From the perspective of the person filling out a form, this is complete gibberish. To work around this, XForms gives responsibility to individual form controls to present data to the user in a manner that’s convenient to the intended audience. Thus, XForms introduces (but doesn’t specifically name) a third space, the user space. For the benefit of users, this might not be a straightforward mapping—the form control can have great latitude in rearranging things, such as a graphical calendar control to enter durations and dates.


XML Schema uses a divide-and-conquer technique to define datatypes. Each datatype can be broken down into a number of facets, each of which constrains some particular part of the allowed value space for that datatype. (One important exception is the pattern facet, which works on the lexical space.)

It’s possible to take an existing datatype and trim it down to exactly meet your needs. This is called derivation by restriction, and entails changing one or more facets in the datatype. For example, the following XML Schema fragment limits the length of a string to 50 characters:

<xs:simpleType name="myString50">
  <xs:restriction  base="xs:string">
    <xs:maxLength value="50"/>

This creates a new datatype named myString50, which can then be used in a form to limit the number of characters that can be entered. Other facets can similarly be restricted, as shown in the examples later in this chapter. The list of facets is as follows.


Specifies a list of possible values.


Specifies a number of digits after the decimal.


Specifies an exact length in characters, or bytes for binary datatypes.


Specifies a maximum value that cannot be reached.


Specifies a maximum value that can be reached.


Specifies a maximum number of characters, or bytes for binary datatypes.


Specifies a minimum value that can’t be reached.


Specifies a minimum value that can be reached.


Specifies a minimum number of characters, or bytes for binary datatypes.


Specifies a regular expression against the lexical space.


Specifies the total number of significant digits.


Specifies how to handle whitespace.

Another kind of derivation is by list. This simply takes a simple datatype and produces a whitespace-separated list datatype. XForms includes a ready-made list datatype called listItems. Another variation is derivation by union, which can combine the value spaces of two separate datatypes. One final variation on derivation is by extension, which is used only in complexTypes, which are discussed later in this chapter.

Regular expressions

One of the most useful facet-based restrictions in forms is pattern, which takes a regular expression syntax, adjusted for Unicode compatibility. Entire books have been written on regular expression, so this section only covers the basics. For further information, a good source is Chapter 6 of Eric van der Vlist’s XML Schema (O’Reilly).

When a regular expression contains letters or digits, the characters must appear in the entered data, as shown in Table 4-1.

Table 4-1. Simple regular expressions



Doesn’t match



Any string other than “hi”

Oftentimes, you might know the format of a string but not the exact contents. For instance, a telephone number might be of the format 123-4567. To handle this, you can use escape sequences, which represent certain character types. Regular expressions support the escape sequences shown in Table 4-2.

Table 4-2. Escape sequences (case matters)



. (dot)

Any Unicode character except newline


Any word character


Any non-word character


Any digit character


Any non-digit character


Any whitespace character


Any non-whitespace character


Any character in the list abc


Any character between a and z in Unicode order


Any character not in the list abc


Any character that is part of UnicodeCharacterClass


The escape sequence \d matches more than just 0-9. It also matches many other characters considered numeric by Unicode, such as U+0A66 (Gurmukhi Digit Zero). While longer, specifying a pattern of [0-9] will give less surprise in some cases.

Regular expressions can also make use of the character classes shown in Table 4-3.

Table 4-3. Character classes



Doesn’t match








“A 3”








\pf{Lu} (Unicode upper case letters)



Using these, more complicated patterns are possible:

Since it quickly becomes tedious to repeat an escape sequence (e.g., representing a telephone number with \n\n\n-\n\n\n\n), regular expressions allow for partial matches, sequences, and repeat counts, as shown in Table 4-4.

Table 4-4. Quantifiers



? (dot)

Repeat zero or once (optional)


Repeat one or more


Repeat zero or more


Repeat exactly n times


Repeat between n and m times


Repeat n or more times

Using quantifiers, more complicated types of expressions are possible. Also, parentheses can be used for grouping, and the vertical bar (|) to express two possible branches, either one of which can satisfy the expression, as shown in Table 4-5.

Table 4-5. Regular expressions with quantifiers



Doesn’t match








“c c”












“90210” or “90210-1241”



“123” or “dbca”

“1234” or “cba”

The final thing to remember is that characters otherwise used for something else need to be escaped when used literally. These characters, in their escaped form, are \\; \|; \.; \-; \^; \?; \*; \+; \{; \}; \(; \); \[; and \].

Table 4-6 provides a few ready-to-use regular expressions, suitable for copy-and- paste.

Table 4-6. Regular expressions: complete examples




Matches an international phone number, such as “+12 1234 123456”


Matches a 7-digit phone number, such as “123-4567”


Matches a 7-digit phone number, with an optional 2-6 digit extension


Matches a US Social Security number


Simplistic email address check


Matches part numbers formatted like “X1234”


Matches one or more space-separated uppercase words

Get XForms Essentials now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.