“Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information on it.”
Forms and datatypes always seem to be mentioned together. It’s natural to think of data entry in terms of specific types, such as date or phone number. Despite a feint in the opposite direction taken by earlier drafts, XForms incorporates the datatypes defined in W3C XML Schema. This chapter discusses these datatypes, and describes the general framework for describing and defining custom datatypes.
In
describing a datatype, XML Schema distinguishes between a
lexical space,
or the data as it appears in XML, and a
value
space, or the data as it exists on an abstract level. In
practice, many datatypes have a one-to-one mapping between the
lexical space and the value space, so the distinction can seem a
little academic. It is important, however, when there are equivalent
representations for some value. For instance, the
boolean
datatype can represent true as either
1
or true
, (and false as either
0
or false
). Even though there
are multiple possible representations, they both map to the
underlying concept of trueness and
falseness, respectively. This is important when
comparing values; the value space is used as the basis for
comparison.
Many observers have pointed out that the lexical representations of
some XML Schema datatypes aren’t very user friendly.
As an example, the duration of a day and an hour is
P1DT1H
. From the perspective of the person filling
out a form, this is complete gibberish. To work around this, XForms
gives responsibility to individual form controls to present data to
the user in a manner that’s convenient to the
intended audience. Thus, XForms introduces (but
doesn’t specifically name) a third space, the
user space. For
the benefit of users, this might not be a straightforward
mapping—the form control can have great latitude in rearranging
things, such as a graphical calendar control to enter durations and
dates.
XML Schema uses a divide-and-conquer technique to define datatypes.
Each datatype can be broken down into a number of facets, each of
which constrains some particular part of the allowed value space for
that datatype. (One important exception is the
pattern
facet, which works on the lexical space.)
It’s possible to take an existing datatype and trim it down to exactly meet your needs. This is called derivation by restriction, and entails changing one or more facets in the datatype. For example, the following XML Schema fragment limits the length of a string to 50 characters:
<xs:simpleType name="myString50"> <xs:restriction base="xs:string"> <xs:maxLength value="50"/> </xs:restriction> </xs:simpleType>
This creates a new datatype named myString50
,
which can then be used in a form to limit the number of characters
that can be entered. Other facets can similarly be restricted, as
shown in the examples later in this chapter. The list of
facets
is as follows.
- enumeration
Specifies a list of possible values.
- fractionDigits
Specifies a number of digits after the decimal.
- length
Specifies an exact length in characters, or bytes for binary datatypes.
- maxExclusive
Specifies a maximum value that cannot be reached.
- maxInclusive
Specifies a maximum value that can be reached.
- maxLength
Specifies a maximum number of characters, or bytes for binary datatypes.
- minExclusive
Specifies a minimum value that can’t be reached.
- minExclusive
Specifies a minimum value that can be reached.
- minLength
Specifies a minimum number of characters, or bytes for binary datatypes.
- pattern
Specifies a regular expression against the lexical space.
- totalDigits
Specifies the total number of significant digits.
- whiteSpace
Specifies how to handle whitespace.
Another kind of derivation is by list. This simply takes a simple
datatype and produces a whitespace-separated list datatype. XForms
includes a ready-made list datatype called
listItems
. Another
variation is derivation by union, which can combine the value spaces
of two separate datatypes. One final variation on derivation is by
extension, which is used only in complexTypes, which are discussed
later in this chapter.
One of the most useful facet-based
restrictions in forms is pattern
, which takes a
regular expression syntax, adjusted for Unicode compatibility. Entire
books have been written on regular expression, so this section only
covers the basics. For further information, a good source is Chapter
6 of
Eric van der
Vlist’s XML Schema
(O’Reilly).
When a regular expression contains letters or digits, the characters must appear in the entered data, as shown in Table 4-1.
Table 4-1. Simple regular expressions
Expression |
Matches |
Doesn’t match |
---|---|---|
“hi” |
“hi” |
Any string other than “hi” |
Oftentimes, you might know the format of a string but not the exact
contents. For instance, a telephone number might be of the format
123-4567
. To handle this, you can use escape
sequences, which represent certain character types. Regular
expressions support the escape sequences shown in
Table 4-2.
Table 4-2. Escape sequences (case matters)
Sequence |
Represents |
---|---|
Any Unicode character except newline | |
Any word character | |
|
Any non-word character |
|
Any digit character |
|
Any non-digit character |
|
Any whitespace character |
|
Any non-whitespace character |
Any character in the list | |
|
Any character between |
Any character not in the list | |
|
Any character that is part of
|
Warning
The escape sequence \d
matches more than just 0-9.
It also matches many other characters considered numeric by Unicode,
such as U+0A66 (Gurmukhi Digit Zero). While longer, specifying a
pattern of [0-9]
will give less surprise in some
cases.
Regular expressions can also make use of the character classes shown in Table 4-3.
Table 4-3. Character classes
Expression |
Matches |
Doesn’t match |
---|---|---|
|
“3” |
“X” |
|
“abc” |
“ab1” |
|
“A 3” |
“A3” |
|
“a” |
“d” |
|
“X” |
“f” |
|
“A” |
“a” |
Using these, more complicated patterns are possible:
Since it quickly becomes tedious to repeat an escape sequence (e.g.,
representing a telephone number with
\n\n\n-\n\n\n\n)
, regular expressions allow for
partial matches, sequences, and repeat counts, as shown in Table 4-4.
Table 4-4. Quantifiers
Quantifier |
Represents |
---|---|
|
Repeat zero or once (optional) |
|
Repeat one or more |
|
Repeat zero or more |
|
Repeat exactly |
|
Repeat between |
|
Repeat |
Using quantifiers, more complicated types of expressions are possible. Also, parentheses can be used for grouping, and the vertical bar (|) to express two possible branches, either one of which can satisfy the expression, as shown in Table 4-5.
Table 4-5. Regular expressions with quantifiers
Expression |
Matches |
Doesn’t match |
---|---|---|
|
“b1b” |
“bbb” |
|
“a123” |
“1234” |
|
“c c” |
“cc” |
|
“31415” |
“314159” |
|
“bbbb” |
“ab” |
|
“0xBEEF” |
“0x0A” |
|
“90210” or “90210-1241” |
“90210-” |
|
“123” or “dbca” |
“1234” or “cba” |
The final thing to remember is that characters otherwise used for
something else need to be escaped when used literally. These
characters, in their escaped form, are \\
;
\|
; \
.; \-
;
\^
; \?
; \*
;
\+
; \{
; \}
;
\(
; \)
; \[
;
and \]
.
Table 4-6 provides a few ready-to-use regular expressions, suitable for copy-and- paste.
Table 4-6. Regular expressions: complete examples
Expression |
Description |
---|---|
|
Matches an international phone number, such as “+12 1234 123456” |
|
Matches a 7-digit phone number, such as “123-4567” |
|
Matches a 7-digit phone number, with an optional 2-6 digit extension |
|
Matches a US Social Security number |
|
Simplistic email address check |
|
Matches part numbers formatted like “X1234” |
|
Matches one or more space-separated uppercase words |
Get XForms Essentials now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.