
Byte Versus Character Handling
|
599
Table 9-12 illustrates what happens with the same character string, but when EUC-JP–
encoded.
Character deletion example—EUC-JPTable 9-12.
Text representation EUC-JP representation
Original string
4 1 1
Correct
Delete
4 1
Add character
4 1 5 5
Incorrect
Delete
4 1
Add character
4 1 C5 B5
is problem is xed by keeping track of the characters at the insertion point—whether
they are represented by one or more bytes. If a byte happens to be the second byte of a
two-byte character, both bytes must be deleted with a single keystroke. In the case of
three-byte characters (for example, characters from EUC-JP code set 3—JIS X 0212-1990
characters), three bytes must be deleted. Extreme examples include EUC-TW and GB
18030 encodings, both of which have a four-byte representation. at’s a lot of bytes,
meaning that a lot can go wrong if you’re not careful. Also, anything outside of the BMP,
when dealing with the UTF-8 encoding form, is four bytes.
Character Insertion
Inserting characters is problematic only when the insertion point—that is, the cursor—is
between the two bytes that represent a two-byte character. is then splits the two-byte
character and results in data loss. is section, as you may have expected, relates ...