RELAX NG

Common Patterns

After this overview of the syntax used by pattern facets, let’s see some common pattern facets you may have to use (or adapt) in your schemas or just consider as examples.

String Datatypes

Regular expressions treat information in its textual form. This makes them an excellent mechanism for constraining strings.

Unicode blocks

Unicode is one of XML’s greatest assets. However, there are few applications able to process and display all the characters of the Unicode set correctly and still fewer users able to read them! If you need to check that your string datatypes belong to one (or more) Unicode blocks, you can use these pattern facets:

<define name="BasicLatinToken">
  <data type="token">
    <param name="pattern">\p{IsBasicLatin}*</param>
  </data>
</define>

<define name="Latin-1Token">
  <data type="token">
    <param name="pattern">[\p{IsBasicLatin}\p{IsLatin-1Supplement}]*</param>
  </data>
</define>

or:

BasicLatinToken = xsd:token {pattern = "\p{IsBasicLatin}*"}

Latin-1Token = xsd:token {pattern = "[\p{IsBasicLatin}\p{IsLatin-1Supplement}]*"

Note that such pattern facets don’t impose a character encoding on the document itself and that, for instance, the Latin-1Token datatype validates instance documents using UTF-8, UTF-16, ISO-8869-1 or another encoding. (This statement assumes the characters used in this string belong to the two Unicode blocks BasicLatin and Latin-1Supplement.) In other words, even the lexical space reflects some processing done by the parser, below the level ...

Get RELAX NG now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

RELAX NG by Eric van der Vlist

Common Patterns

String Datatypes

Unicode blocks

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly