
To illustrate the nature of boundary rules, which are not yet widely implemented along
these lines in existing software, we will consider the word boundaries. Described in-
formally and somewhat loosely, the principles are:
• Treat consecutive alphabetic characters as belonging to the same word. This ap-
plies to characters for which the Alphabetic property has the value “yes” (True),
except characters belonging to Thai, Lao, or Hiragana writing system, as well as
to the no-break space character (somewhat surprisingly?).
• Treat digits and other numeric characters as comparable to alphabetic characters
(e.g., treat “3A” as one word).
• Do not break a numeric string at a character that has a LineBreak property value
of IN = Infix, numeric (except for “:”). For example, treat “1.000,00” as one word.
• Treat connector punctuation such as “_” (with General Category value of Cp =
Connector, punctuation) as comparable to alphabetic characters (e.g., treat
“foo_bar” as one word).
• Treat a grapheme cluster as if it were one character.
• Regard the following as part of word when they appear between alphabetic char-
acters: apostrophe ' (U+0027), right single quotation mark ’ (U+2019), middle dot
· (U+00B7), hyphenation point ‧ (U+2027), colon : (U+003A), and Hebrew punc-
tuation gershayim ״ (U+05F4).
For example, the principle mentioned last in the list works well for some strings that
need to