Skip to Content
Perl Cookbook
book

Perl Cookbook

by Tom Christiansen, Nathan Torkington
August 1998
Intermediate to advanced
800 pages
39h 20m
English
O'Reilly Media, Inc.
Content preview from Perl Cookbook

Honoring Locale Settings in Regular Expressions

Problem

You want to translate case when in a different locale, or you want to make \w match letters with diacritics, such as José or déjà vu.

For example, let’s say you’re given half a gigabyte of text written in German and told to index it. You want to extract words (with \w+) and convert them to lower-case (with lc or \L), but the normal versions of \w and lc neither match the German words nor change the case of accented letters.

Solution

Perl’s regular-expression and text-manipulation routines have hooks to POSIX locale setting. If you use the use locale pragma, accented characters are taken care of—assuming a reasonable LC_CTYPE specification and system support for the same.

use locale;

Discussion

By default, \w+ and case-mapping functions operate on upper- and lowercase letters, digits, and underscores. This works only for the simplest of English words, failing even on many common imports. The use locale directive lets you redefine what a “word character” means.

In Example 6.10 you can see the difference in output between having selected the English (“en”) locale and the German (“de”) one.

Example 6-10. localeg

#!/usr/bin/perl -w # localeg - demonstrate locale effects use locale; use POSIX 'locale_h'; $name = "andreas k\xF6nig"; @locale{qw(German English)} = qw(de_DE.ISO_8859-1 us-ascii); setlocale(LC_CTYPE, $locale{English}) or die "Invalid locale $locale{English}"; @english_names = (); while ($name =~ /\b(\w+)\b/g) { push(@english_names, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Perl Cookbook, 2nd Edition

Perl Cookbook, 2nd Edition

Tom Christiansen, Nathan Torkington
Perl One-Liners

Perl One-Liners

Peteris Krumins
Perl Best Practices

Perl Best Practices

Damian Conway
Perl in a Nutshell, 2nd Edition

Perl in a Nutshell, 2nd Edition

Nathan Patwardhan, Ellen Siever, Stephen Spainhour

Publisher Resources

ISBN: 1565922433Supplemental ContentCatalog PageErrata