Perl 6 "regular expressions” are so far beyond the formal definition of regular expressions that we decided it was time for a more meaningful name.[19] We now call them “rules.” Perl 6 rules bring the full power of recursive descent parsing to the core of Perl, but are comfortably useful even if you don’t know anything about recursive descent parsing. A grammar is a collection of rules, in the same way that a class is a collection of methods.
A rule is just a
pattern for matching text. Rules can match right where
they’re defined, or they can be stored up to match
later. Rules can be named or anonymous. They may be defined with
variations on the familiar /.../
syntax, or using
subroutine-like syntax with the keyword rule
.
Table 4-2 shows the basic syntax for defining
rules.
Table 4-2. Rules
Syntax |
Meaning |
---|---|
|
Match a pattern (immediate execution). |
|
Perform a substitution (immediate execution). |
|
Define an anonymous rule (deferred execution). |
|
Immediately match or define an anonymous rule, depending on the context. |
|
Define an anonymous rule. |
|
Define a named rule. |
?...?
and (...)
are no longer
valid replacements for the /.../
delimiters, but
you can use other standard quoting characters as replacement
delimiters. The unary context forcing operators,
+
, ?
, and ~
,
interact with the bare /.../
.
+/.../
immediately matches and returns a count of
matches. ?/.../
immediately matches and returns a
boolean value of success or failure. ~/.../
immediately matches and returns the matched string value. The results
are the ordinary behavior of /.../
in numeric,
boolean, and string contexts. The bare /.../
also
matches immediately in void context, or when it’s an
argument of the smart match operator (~~
). In all
other contexts, it constructs an anonymous rule.
Every pattern is built out of a series of metacharacters, metasymbols, bracketing symbols, escape sequences, and assertions of various types. These are the basic vocabulary of pattern matching. The most basic set of metacharacters is shown in Table 4-3.
Table 4-3. Metacharacters
Symbol |
Meaning |
---|---|
. |
Match any single character, including a newline. |
|
Match the beginning of a string. |
|
Match the end of a string. |
|
Match the beginning of a line. |
|
Match the end of a line. |
|
Separate alternate patterns. |
|
Escape a metacharacter to get a literal character, or escape a literal character to get a metacharacter. |
|
Mark a comment (to the end of the line). |
|
Bind the result of a match to a hypothetical variable. |
|
Group patterns and capture the result. |
|
Group patterns without capturing. |
|
Execute a closure (Perl 6 code) within a rule. |
|
Assertion delimiters. |
By default, rules ignore literal whitespace within the pattern. You
can put the #
comment marker at the end of any
line. Just make sure you don’t comment out the
symbol that terminates the rule. Closures within bare
{...}
are always a successful zero-width match,
unless they explicitly call the fail
function.
Assertions, marked with <...>
delimiters,
handle a variety of constructs, including character classes and
user-defined quantifiers. The built-in
quantifiers are shown in Table 4-4.
Table 4-4. Quantifiers
Maximal |
Minimal |
Meaning |
---|---|---|
|
|
Match 0 or more times. |
|
|
Match 1 or more times. |
|
|
Match 0 or 1 times. |
|
|
Match exactly |
|
|
Match at least |
|
|
Match at least |
n
.
.m
is the range quantifier, so it uses the range operator
".
.”.
n
..
. is shorthand for
n
..Inf
and matches as
many times as possible.
Table 4-5 shows the escape sequences for special
characters. With all the escape sequences that use brackets,
(...)
, {...}
, and
<...>
work in place of
[...]
. An ordinary variable now interpolates as a
literal string, so \Q
is rarely needed.
Table 4-5. Escape sequences
Escape |
Meaning |
---|---|
|
Match a character given in octal (brackets optional). |
|
Match a word boundary. |
|
Match when not on a word boundary. |
|
Match a named character or control character. |
|
Match any character except the bracketed named or control character. |
|
Match a digit. |
|
Match a non-digit. |
|
Match an escape character. |
|
Match anything but an escape character. |
|
Match the form feed character. |
|
Match anything but a form feed. |
|
Match a newline. |
|
Match anything but a newline. |
|
Match horizontal whitespace. |
|
Match anything but horizontal whitespace. |
|
Everything within the brackets is lowercase. |
|
Match a return. |
|
Match anything but a return. |
|
Match any whitespace character. |
|
Match anything but whitespace. |
|
Match a tab. |
|
Match anything but a tab. |
|
Everything within the brackets is uppercase. |
|
Match vertical whitespace. |
|
Match anything but vertical whitespace. |
|
Match a word character (Unicode alphanumeric plus “_”). |
|
Match anything but a word character. |
|
Match a character given in hexadecimal (brackets optional). |
|
Match anything but the character given in hexadecimal (brackets optional). |
|
All metacharacters within the brackets match as literal characters. |
Modifiers alter
the meaning of the pattern syntax. The standard position for
modifiers is at the beginning of the rule, right after the
m
, s
, or rx
,
or after the name in a named rule. Modifiers cannot attach to the
outside of a bare /.../
. For example:
m:i/marvin/ # case insensitive rule names :i { marvin | ford | arthur }
The single-character modifiers can be grouped, but the others must be separated by a colon:
m:iwe/ zaphod / # Ok m:ignorecase:words:each/ zaphod / # Ok m:ignorecasewordseach / zaphod / # Not Ok
Most of the modifiers can also go inside the rule, attached to the rule delimiters or to grouping delimiters. Internal modifiers are lexically scoped to their enclosing delimiters, so you get a temporary alteration of the pattern:
m/:w I saw [:i zaphod] / # only 'zaphod' is case insensitive
Really, it’s only the repetition modifiers that can’t be lexically scoped, because they alter the return value of the entire rule. Table 4-6 shows the current list of modifiers.
Table 4-6. Modifiers
Short |
Long |
Meaning |
---|---|---|
|
|
Case-insensitive match. |
|
Case-sensitive match (on by default). | |
|
|
Continue where the previous match on the string left off. |
|
|
Literal whitespace in the pattern matches as |
|
Turn off intelligent whitespace matching (return to default). | |
: |
Match the pattern | |
: |
Match the | |
|
Match the pattern only once. | |
|
|
Match the pattern as many times as possible, but only possibilities that don’t overlap. |
|
Match every possible occurrence of a pattern, even overlapping possibilities. | |
|
. is a byte. | |
|
. is a Unicode codepoint. | |
|
. is a Unicode grapheme. | |
|
. is language dependent. | |
|
The pattern uses Perl 5 regex syntax. |
:w
makes patterns sensitive to literal whitespace,
but in an intelligent way. Any cluster of literal whitespace acts
like an explicit \s+
when it separates two
identifiers and \s*
everywhere else.
The :Nth modifier also has
the alternate forms :Nst,
:Nnd, and
:Nrd for cases where
it’s more natural to write :1st
,
:2nd
, :3rd
than it is to write
:1th
, :2th
,
:3th
. Either way is valid, so pick the one
that’s most comfortable for you.
There are no modifiers to alter whether the matched string is treated as a single line or multiple lines. That’s why the “beginning of string” and “end of string” metasymbols now have “beginning of line” and “end of line” counterparts.
Assertions hold many different constructs with many different purposes. In general, an assertion simply states that some condition or state is true and the match fails when that assertion is false. Table 4-7 shows the syntax for assertions.
Table 4-7. Assertions
Syntax |
Meaning |
---|---|
|
Generic assertion delimiter. |
|
Match a named rule or character class. |
|
Negate any assertion. |
|
Match an enumerated character class. |
|
Complement a character class (named or enumerated). |
|
Match a literal string (interpolated at match time). |
|
Match a literal string (not interpolated). |
|
Boolean assertion. Execute a closure and match if it returns a true result. |
|
Match an anonymous rule. |
|
Match a series of anonymous rules as alternates. |
|
Match a key from the hash, then its value (which is an anonymous rule). |
|
Match an anonymous rule returned by a sub. |
|
Match an anonymous rule returned by a closure. |
|
Match any logical grapheme, including combining character sequences. |
<(...)>
is similar to
{...}
, in that it allows you to include straight
Perl code within a rule. The difference is that
<(...)>
evaluates the return value of the
closure in boolean context. The match succeeds if the return is true
and fails if the return is false.
A bare scalar within a pattern interpolates as a literal string, an
array matches as a series of alternate literal strings, and by
default a hash matches a word (\w+
) and tries to
find that word as one of its keys.[20] You have to enclose a variable in assertion delimiters to
get it to interpolate as an anonymous rule or rules.[21]
A number of named rules are provided by default, including a complete set of POSIX-style classes, and Unicode property classes. The list isn’t fully defined yet, but Table 4-8 shows a few you’re likely to see.
Table 4-8. Built-in rules
Rule |
Meaning |
---|---|
|
Match a Unicode alphabetic character. |
|
Match a digit. |
|
Match a single space character (the same as |
|
Match any whitespace (the same as |
|
Match the null string. |
|
Match the same thing as the previous match. |
|
Lookahead. Assert that you’re before a pattern. |
|
Lookbehind. Assert that you’re after a pattern. |
|
Match any character with the named property. |
|
Replace everything matched so far in the rule or subrule with the given string (under consideration). |
The
null pattern //
is no
longer valid syntax for rules. The built-in rule
<null>
matches a zero-width string (so
it’s always true) and
<prior>
matches whatever the most recent
successful rule matched.
Backtracking is triggered whenever part
of the pattern fails to match. You can also explicitly trigger
backtracking by calling the
fail
function within a closure. Table 4-9 shows some metacharacters and built-in rules
relevant to backtracking.
Table 4-9. Backtracking controls
Operator |
Meaning |
---|---|
: |
Don’t retry the previous atom, fail to the next earlier atom. |
|
Don’t backtrack over this point, fail out of the
closest enclosing group ( |
|
Don’t backtrack over this point, fail out of the current rule or subrule. |
|
Don’t backtrack over this point, fail out of the entire match (even from within a subrule). |
|
Like |
Hypothetical variables are a powerful
way of building up data structures from within a match. An ordinary
capture with (...)
stores the result of the
capture in
$1
, $2
, etc. The values stored
in these variables will be kept if the match is successful, but
thrown away if the match fails (hence the term
“hypothetical”). The numbered
capture variables are accessible outside the match, but only within
the immediate surrounding lexical scope:
"Zaphod Beeblebrox" ~~ m:w/ (\w+) (\w+) /; print $1; # prints Zaphod
You can also capture into any user-defined variable with the binding
operator :=
. These variables must already be
defined in the lexical scope surrounding the rule:
my $person; "Zaphod's just this guy." ~~ / ^ $person := (\w+) /; print $person; # prints Zaphod
Repeated matches can be captured into an array:
my @words; "feefifofum" ~~ / @words := (f<-[f]>+)* /; # ("fee", "fi", "fo", "fum")
Pairs of repeated matches can be captured into a hash:
my %customers; $records ~~ m:w/ %customers := [ <id> = <name> \n]* /;
If you don’t need the captured value outside the
rule, use a $?
variable instead. These are
lexically scoped to the rule:
"Zaphod saw Zaphod" ~~ m:w/ $?name := (\w+) \w+ $?name/;
A match of a named rule stores the result in a $?
variable with the same name as the rule. These variables are also
accessible only within the rule:
"Zaphod saw Zaphod" ~~ m:w/ <name> \w+ $?name /;
Sometimes you don’t want one rule, you need a whole
collection of rules, especially for
complex text parsing. Rules live in a
grammar, like methods live in a class. In fact, grammars are classes,
they’re just classes that inherit from the universal
Rule
class. This means that grammars can
inherit from other grammars, and that they define a namespace for
their rules.
grammar Hitchhikers { rule name :w { Zaphod :: [Beeblebrox]? | Ford :: [Prefect]? | Arthur :: [Dent]? } rule id :w { \d<10> } }
Any rule in the current grammar or in one of its parents can be called directly, but rules from other grammars need to have their package specified:
if $newsrelease ~~ /<Hitchhiker.name>/ { send_alert($1); }
[19] Regular expressions describe regular languages, and consist of three primitives and a limited set of operations (three or so, depending on the formulation). So, even Perl 5 “regular expressions” weren’t formal regular expressions.
[20] The effect is much as if it matched the keys as a series of alternates, but you’re guaranteed to match the longest possible key, instead of just the first one it hits in random order.
[21] This is the old Perl 5 behavior of a variable interpolating as a regex, but with a kick.
Get Perl 6 Essentials now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.