
for example, [\u3040-\u309F] might denote the set of characters from U+3040 to
U+309F.
Specifying sets of characters by properties
Some notation is needed for denoting sets of characters by properties. At least the
following properties must be supported: General Category, Script, Alphabetic,
Uppercase, Lowercase, Whitespace, Noncharacter Code Point, and Default Igno-
rable Code Point. The specific syntax may vary, but the recommendation is that
both abbreviated names and longer, more descriptive names of properties and their
values be recognized. Moreover, implementations should apply loose matching of
property names, ignoring the case distinctions, whitespace, hyphens, and under-
lines. Thus, assuming that the specific syntax is of the form \p{name=value} (to
denote characters for which a particular property has the specified value), then
\p{General_Category=Letter} and \{gc=L} should both be accepted. The properties
Script and General Category may have the property name omitted. Thus, simple
\p{letter} or p{L} should work, too.
Set subtraction and intersection
A notation is required for specifying the set difference and set intersection of two
sets of characters. The operator could be “-” for difference, & for intersection.
Thus, [\p{Letter} - Qq] could mean any letter but “Q” or “q,” and [\p{Latin} &
[\u41 - \u2AF]] could mean Latin letters in the range U+0041 to U+02AF.
Word analysis
An ...