Honoring Locale Settings in Regular Expressions
Problem
You want to translate case when in a different locale, or you want to
make \w match letters with diacritics, such as
José or déjà
vu.
For example, let’s say you’re given half a gigabyte of
text written in German and told to index it. You want to extract
words (with \w+) and convert them to lower-case
(with lc or \L), but the normal
versions of \w and lc neither
match the German words nor change the case of accented letters.
Solution
Perl’s regular-expression and text-manipulation routines have
hooks to POSIX locale setting. If you use the
use
locale pragma, accented characters are taken care
of—assuming a reasonable LC_CTYPE
specification and system support for the same.
use locale;
Discussion
By default, \w+ and case-mapping functions operate
on upper- and lowercase letters, digits, and underscores. This works
only for the simplest of English words, failing even on many common
imports. The use
locale
directive lets you redefine what a “word character”
means.
In Example 6.10 you can see the difference in output between having selected the English (“en”) locale and the German (“de”) one.
Example 6-10. localeg
#!/usr/bin/perl -w # localeg - demonstrate locale effects use locale; use POSIX 'locale_h'; $name = "andreas k\xF6nig"; @locale{qw(German English)} = qw(de_DE.ISO_8859-1 us-ascii); setlocale(LC_CTYPE, $locale{English}) or die "Invalid locale $locale{English}"; @english_names = (); while ($name =~ /\b(\w+)\b/g) { push(@english_names, ...Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access