
602
|
Chapter 9: Information Processing Techniques
Character searching example—Shi-JISTable 9-15.
Characters Shift-JIS representation
Search string
8 95
Correct
8 95 93 9
Incorrect
94 92 8 8C 95 61
Note how the example of an incorrect match spans two characters, specically the second
byte of one character and the rst byte of the next one. e incorrect match was made by
treating every byte as a single character. Clearly, the one-byte-equals-one-character barrier
or mind set must be overcome in order to handle CJKV text properly. is is a crucial is-
sue for those who are writing multiple-byte–capable search engines, a topic covered later
in this chapter in the section entitled “Search Engines.”
Line Breaking
Many text-processing programs allow users to break long lines into shorter ones, usually
by specifying a maximum number of columns per line. As you can expect, breaking a line
between the bytes of a two-byte character can result in a loss of information and end up
corrupting surrounding characters.
Let’s look at what may happen when ISO-2022-JP, Shi-JIS, and EUC-JP strings are bro-
ken into two lines. In the example string given in Tables 9-16 through 9-18, a line break
is inserted between the two bytes that represent the katakana character (sa). Note how
that character is apparently lost, and how some characters aer the line break become
scrambled. Some Japanese ...