Matching Multiple-Byte Characters

Problem

You need to perform regular-expression searches against multiple-byte characters.

A character encoding is a set mapping from characters and symbols to digital representations. ASCII is an encoding where each character is represented as exactly one byte, but complex writing systems, such as those for Chinese, Japanese, and Korean, have so many characters that their encodings need to use multiple bytes to represent characters.

Perl works on the principle that each byte represents a single character, which works well in ASCII but makes regular expression matches on strings containing multiple-byte characters tricky, to say the least. The regular expression engine does not understand the character boundaries in your string of bytes, and so can return “matches” from the middle of one character to the middle of another.

Solution

Exploit the encoding by tailoring the pattern to the sequences of bytes that constitute characters. The basic approach is to build a pattern that matches a single (multiple byte) character in the encoding, and then use that “any character” pattern in larger patterns.

Discussion

As an example, we’ll examine one of the encodings for Japanese, called EUC-JP, and then show how we use this in solving a number of multiple-byte encoding issues. EUC-JP can represent thousands of characters, but it’s basically a superset of ASCII. Bytes with values ranging from to 127 (0x00 to 0x7F) are almost exactly their ASCII counterparts, ...

Get Perl Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.