Skip to Content
Perl Cookbook
book

Perl Cookbook

by Tom Christiansen, Nathan Torkington
August 1998
Intermediate to advanced
800 pages
39h 20m
English
O'Reilly Media, Inc.
Content preview from Perl Cookbook

Matching Multiple-Byte Characters

Problem

You need to perform regular-expression searches against multiple-byte characters.

A character encoding is a set mapping from characters and symbols to digital representations. ASCII is an encoding where each character is represented as exactly one byte, but complex writing systems, such as those for Chinese, Japanese, and Korean, have so many characters that their encodings need to use multiple bytes to represent characters.

Perl works on the principle that each byte represents a single character, which works well in ASCII but makes regular expression matches on strings containing multiple-byte characters tricky, to say the least. The regular expression engine does not understand the character boundaries in your string of bytes, and so can return “matches” from the middle of one character to the middle of another.

Solution

Exploit the encoding by tailoring the pattern to the sequences of bytes that constitute characters. The basic approach is to build a pattern that matches a single (multiple byte) character in the encoding, and then use that “any character” pattern in larger patterns.

Discussion

As an example, we’ll examine one of the encodings for Japanese, called EUC-JP, and then show how we use this in solving a number of multiple-byte encoding issues. EUC-JP can represent thousands of characters, but it’s basically a superset of ASCII. Bytes with values ranging from to 127 (0x00 to 0x7F) are almost exactly their ASCII counterparts, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Perl Cookbook, 2nd Edition

Perl Cookbook, 2nd Edition

Tom Christiansen, Nathan Torkington
Perl One-Liners

Perl One-Liners

Peteris Krumins
Perl Best Practices

Perl Best Practices

Damian Conway
Perl in a Nutshell, 2nd Edition

Perl in a Nutshell, 2nd Edition

Nathan Patwardhan, Ellen Siever, Stephen Spainhour

Publisher Resources

ISBN: 1565922433Supplemental ContentCatalog PageErrata