Honoring Locale Settings in Regular Expressions
Problem
You want to translate case when in a different locale, or you want to
make \w
match letters with diacritics, such as
José or déjà
vu.
For example, let’s say you’re given half a gigabyte of
text written in German and told to index it. You want to extract
words (with \w+
) and convert them to lower-case
(with lc
or \L
), but the normal
versions of \w
and lc
neither
match the German words nor change the case of accented letters.
Solution
Perl’s regular-expression and text-manipulation routines have
hooks to POSIX locale setting. If you use the
use
locale
pragma, accented characters are taken care
of—assuming a reasonable LC_CTYPE
specification and system support for the same.
use locale;
Discussion
By default, \w+
and case-mapping functions operate
on upper- and lowercase letters, digits, and underscores. This works
only for the simplest of English words, failing even on many common
imports. The use
locale
directive lets you redefine what a “word character”
means.
In Example 6.10 you can see the difference in output between having selected the English (“en”) locale and the German (“de”) one.
Example 6-10. localeg
#!/usr/bin/perl -w # localeg - demonstrate locale effects use locale; use POSIX 'locale_h'; $name = "andreas k\xF6nig"; @locale{qw(German English)} = qw(de_DE.ISO_8859-1 us-ascii); setlocale(LC_CTYPE, $locale{English}) or die "Invalid locale $locale{English}"; @english_names = (); while ($name =~ /\b(\w+)\b/g) { push(@english_names, ...
Get Perl Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.