More Goodies
One thing to always be aware of is that, by default, the Perl shortcuts like
\w, \s, and even \d match many Unicode characters based on
particular character properties. These are described in Table 5-11, and are intended to match the formal
definitions given in Annex C: Compatibility Properties from Unicode
Technical Standard #18, “Unicode Regular Expressions”, version 13,
from August 2008.
If you are used to matching (\d+) to grab a whole number and use it like
a number, that will not always work correctly with Unicode data. As of
Unicode v6.0, 420 codepoints are matched by \d. If you don’t want that, you may specify
/\d/a or /(?a:\d)/, or you may use the more
particular property, \p{POSIX_Digit}.
However, if you mean to match any run of decimal digits in any
one script and need to use that match as a number in Perl, the
Unicode::UCD module’s num
function will help you do that.
use v5.14;
use utf8;
use Unicode::UCD qw(num);
my $num;
if ("४५६७" =~ /(\d+)/) {
$num = num($1);
printf "Your number is %d\n", $num;
# Your number is 4567
}Although regexes can ask whether a character has some property,
they can’t tell you what properties the character has (at least, not
without testing all of them). And sometimes you really do want to know
that. For example, suppose you want to know what Script a codepoint
has been assigned, or what its General Category is. To do that, you
use the same Unicode::UCD module again. Here is a program to print out useful properties you can use in pattern ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access