Chapter 4. XML Schema in XForms

“Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information on it.”

Samuel Johnson

Forms and datatypes always seem to be mentioned together. It’s natural to think of data entry in terms of specific types, such as date or phone number. Despite a feint in the opposite direction taken by earlier drafts, XForms incorporates the datatypes defined in W3C XML Schema. This chapter discusses these datatypes, and describes the general framework for describing and defining custom datatypes.

Wide Open (Value) Spaces

In describing a datatype, XML Schema distinguishes between a lexical space, or the data as it appears in XML, and a value space, or the data as it exists on an abstract level. In practice, many datatypes have a one-to-one mapping between the lexical space and the value space, so the distinction can seem a little academic. It is important, however, when there are equivalent representations for some value. For instance, the boolean datatype can represent true as either 1 or true, (and false as either 0 or false). Even though there are multiple possible representations, they both map to the underlying concept of trueness and falseness, respectively. This is important when comparing values; the value space is used as the basis for comparison.

Many observers have pointed out that the lexical representations of some XML Schema datatypes aren’t very user friendly. As an example, the duration of a day and an hour is P1DT1H. From the perspective of the person filling out a form, this is complete gibberish. To work around this, XForms gives responsibility to individual form controls to present data to the user in a manner that’s convenient to the intended audience. Thus, XForms introduces (but doesn’t specifically name) a third space, the user space. For the benefit of users, this might not be a straightforward mapping—the form control can have great latitude in rearranging things, such as a graphical calendar control to enter durations and dates.

Derivation

XML Schema uses a divide-and-conquer technique to define datatypes. Each datatype can be broken down into a number of facets, each of which constrains some particular part of the allowed value space for that datatype. (One important exception is the pattern facet, which works on the lexical space.)

It’s possible to take an existing datatype and trim it down to exactly meet your needs. This is called derivation by restriction, and entails changing one or more facets in the datatype. For example, the following XML Schema fragment limits the length of a string to 50 characters:

<xs:simpleType name="myString50">
  <xs:restriction  base="xs:string">
    <xs:maxLength value="50"/>
  </xs:restriction>
</xs:simpleType>

This creates a new datatype named myString50, which can then be used in a form to limit the number of characters that can be entered. Other facets can similarly be restricted, as shown in the examples later in this chapter. The list of facets is as follows.

enumeration

Specifies a list of possible values.

fractionDigits

Specifies a number of digits after the decimal.

length

Specifies an exact length in characters, or bytes for binary datatypes.

maxExclusive

Specifies a maximum value that cannot be reached.

maxInclusive

Specifies a maximum value that can be reached.

maxLength

Specifies a maximum number of characters, or bytes for binary datatypes.

minExclusive

Specifies a minimum value that can’t be reached.

minExclusive

Specifies a minimum value that can be reached.

minLength

Specifies a minimum number of characters, or bytes for binary datatypes.

pattern

Specifies a regular expression against the lexical space.

totalDigits

Specifies the total number of significant digits.

whiteSpace

Specifies how to handle whitespace.

Another kind of derivation is by list. This simply takes a simple datatype and produces a whitespace-separated list datatype. XForms includes a ready-made list datatype called listItems. Another variation is derivation by union, which can combine the value spaces of two separate datatypes. One final variation on derivation is by extension, which is used only in complexTypes, which are discussed later in this chapter.

Regular expressions

One of the most useful facet-based restrictions in forms is pattern, which takes a regular expression syntax, adjusted for Unicode compatibility. Entire books have been written on regular expression, so this section only covers the basics. For further information, a good source is Chapter 6 of Eric van der Vlist’s XML Schema (O’Reilly).

When a regular expression contains letters or digits, the characters must appear in the entered data, as shown in Table 4-1.

Table 4-1. Simple regular expressions

Expression

Matches

Doesn’t match

“hi”

“hi”

Any string other than “hi”

Oftentimes, you might know the format of a string but not the exact contents. For instance, a telephone number might be of the format 123-4567. To handle this, you can use escape sequences, which represent certain character types. Regular expressions support the escape sequences shown in Table 4-2.

Table 4-2. Escape sequences (case matters)

Sequence

Represents

. (dot)

Any Unicode character except newline

\w

Any word character

\W

Any non-word character

\d

Any digit character

\D

Any non-digit character

\s

Any whitespace character

\S

Any non-whitespace character

[abc]

Any character in the list abc

[a-z]

Any character between a and z in Unicode order

[^abc]

Any character not in the list abc

\p{UnicodeCharClass}

Any character that is part of UnicodeCharacterClass

Warning

The escape sequence \d matches more than just 0-9. It also matches many other characters considered numeric by Unicode, such as U+0A66 (Gurmukhi Digit Zero). While longer, specifying a pattern of [0-9] will give less surprise in some cases.

Regular expressions can also make use of the character classes shown in Table 4-3.

Table 4-3. Character classes

Expression

Matches

Doesn’t match

\d

“3”

“X”

\w\w\w

“abc”

“ab1”

.\s.

“A 3”

“A3”

[abc]

“a”

“d”

[^a-f]

“X”

“f”

\pf{Lu} (Unicode upper case letters)

“A”

“a”

Using these, more complicated patterns are possible:

Since it quickly becomes tedious to repeat an escape sequence (e.g., representing a telephone number with \n\n\n-\n\n\n\n), regular expressions allow for partial matches, sequences, and repeat counts, as shown in Table 4-4.

Table 4-4. Quantifiers

Quantifier

Represents

? (dot)

Repeat zero or once (optional)

+

Repeat one or more

*

Repeat zero or more

{n}

Repeat exactly n times

{n,m}

Repeat between n and m times

{n,}

Repeat n or more times

Using quantifiers, more complicated types of expressions are possible. Also, parentheses can be used for grouping, and the vertical bar (|) to express two possible branches, either one of which can satisfy the expression, as shown in Table 4-5.

Table 4-5. Regular expressions with quantifiers

Expression

Matches

Doesn’t match

\w\d?\w

“b1b”

“bbb”

\w\d*

“a123”

“1234”

\w\s+\w

“c c”

“cc”

\d{4,5}

“31415”

“314159”

[bcd]{3,}

“bbbb”

“ab”

0x[0-9A-F]{4}

“0xBEEF”

“0x0A”

[0-9]{5}(-[0-9]{4})?

“90210” or “90210-1241”

“90210-”

\d{3}|[a-z]{4}

“123” or “dbca”

“1234” or “cba”

The final thing to remember is that characters otherwise used for something else need to be escaped when used literally. These characters, in their escaped form, are \\; \|; \.; \-; \^; \?; \*; \+; \{; \}; \(; \); \[; and \].

Table 4-6 provides a few ready-to-use regular expressions, suitable for copy-and- paste.

Table 4-6. Regular expressions: complete examples

Expression

Description

\+\d{2}\s\d{4}\s\d{6}

Matches an international phone number, such as “+12 1234 123456”

\d{3}-\d{4}

Matches a 7-digit phone number, such as “123-4567”

\d{3}-\d{4}(x\d{2,6})?

Matches a 7-digit phone number, with an optional 2-6 digit extension

\d{3}-\d{2}-\d{4}

Matches a US Social Security number

\w+@\w+\.\w+

Simplistic email address check

X\d{4}

Matches part numbers formatted like “X1234”

\p{Lu}+(\s+\p{Lu}+)*

Matches one or more space-separated uppercase words

Get XForms Essentials now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.