POSIX-Style Character Classes
Unlike Perl’s other character class shortcuts, the legacy POSIX-style character-class
syntax notation, [:, is
available for use only when constructing other
character classes—that is, inside an additional pair of square brackets.
For example, CLASS:]/[.,[:alpha:][:digit:]]/
will search for one character that is either a literal dot (because it’s
in a bracketed character class), a comma, an alphabetic character, or a
digit. All may be used as character properties of the same name; for
example, [.,\p{alpha}\p{digit}].
Except for “punct”, explained
immediately below, the POSIX character class names can be used as
properties with \p{} or \P{} with the same meanings. This has two
advantages: it is easier to type because you don’t need to surround them
with extra brackets; and, perhaps more importantly, because as
properties their definitions are no longer affected by charset
modifiers—they always match as Unicode. In contrast, using the [[:...:]] notation, the POSIX classes
are affected by modifier flags.
The \p{punct} property differs
from the [[:punct:]] POSIX class in
that \p{punct} never matches
nonpunctuation, but [[:punct:]] (and
\p{POSIX_Punct} and \p{X_POSIX_Punct}) will. This is because
Unicode splits what POSIX considers punctuation into two categories:
Punctuation and Symbols. Unlike \p{punct}, the others just mentioned also will
match the characters shown in Table 5-14.
Table 5-14. ASCII symbols that count as punctuation
| Glyph | Code | Category | Script |
|---|
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access