Custom Regex Boundaries
A \b for a word boundary
and \B for a
non-(word boundary) both rely on your current definition of \w (meaning that they change right along
with \w if you switch to ASCII
semantics with the /a or /aa modifier).
If those aren’t quite the kind of boundaries you’re looking
for, you can always write your own boundary assertions based on
arbitrary edge conditions, like script boundaries. Here is the
definition of \b:
(?(?<= \w) # if there is a word character to the left
(?! \w) # then there must be no word character to the right
| (?= \w) # else there must be a word character to the right
)And here is the definition of \B:
(?(?<= \w) # if there is a word character to the left
(?= \w) # then there must be a word character to the right
| (?! \w) # else there must be no word character to the right
)Now that you know exactly how word boundaries and
nonboundaries work, you can craft your own boundaries by swapping in
your own condition wherever you see \w in the patterns above. You just need to
be careful to specify a fixed-width condition so that it can be used
in a lookbehind. That means you can’t use things like \X or \R, which are variable width. The easiest
way to do that is to use a property or other character class. For
example, you could use \p{Greek}
for characters in the Greek script—but best add Inherited so you
don’t miss the combining characters, so use [\p{Greek}\p{Inherited}] instead.
For example, this might provide regex subroutines suitable for that kind ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access