Honoring Locale Settings in Regular Expressions

Problem

You want to translate case when in a different locale, or you want to make \w match letters with diacritics, such as José or déjà vu.

For example, let’s say you’re given half a gigabyte of text written in German and told to index it. You want to extract words (with \w+) and convert them to lower-case (with lc or \L), but the normal versions of \w and lc neither match the German words nor change the case of accented letters.

Solution

Perl’s regular-expression and text-manipulation routines have hooks to POSIX locale setting. If you use the use locale pragma, accented characters are taken care of—assuming a reasonable LC_CTYPE specification and system support for the same.

use locale;

Discussion

By default, \w+ and case-mapping functions operate on upper- and lowercase letters, digits, and underscores. This works only for the simplest of English words, failing even on many common imports. The use locale directive lets you redefine what a “word character” means.

In Example 6.10 you can see the difference in output between having selected the English (“en”) locale and the German (“de”) one.

Example 6-10. localeg

#!/usr/bin/perl -w # localeg - demonstrate locale effects use locale; use POSIX 'locale_h'; $name = "andreas k\xF6nig"; @locale{qw(German English)} = qw(de_DE.ISO_8859-1 us-ascii); setlocale(LC_CTYPE, $locale{English}) or die "Invalid locale $locale{English}"; @english_names = (); while ($name =~ /\b(\w+)\b/g) { push(@english_names, ...

Get Perl Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.