16.11. Reading or Writing Unicode Characters
Problem
You want to read Unicode-encoded characters from a file, database, or form; or, you want to write Unicode-encoded characters.
Solution
Use
utf8_encode( )
to convert single-byte ISO-8859-1 encoded
characters to UTF-8:
print utf8_encode('Kurt Gödel is swell.');Use utf8_decode( )
to
convert UTF-8 encoded characters to single-byte ISO-8859-1 encoded
characters:
print utf8_decode("Kurt G\xc3\xb6del is swell.");Discussion
There are 256 possible ASCII characters. The characters between codes 0 and 127 are standardized: control characters, letters and numbers, and punctuation. There are different rules, however, for the characters that codes 128-255 map to. One encoding is called ISO-8859-1, which includes characters necessary for writing most European languages, such as the ö in Gödel or the ñ in pestaña. Many languages, though, require more than 256 characters, and a character set that can express more than one language requires even more characters. This is where Unicode saves the day; its UTF-8 encoding can represent more than a million characters.
This increased functionality comes at the cost of space. ASCII characters are stored in just one byte; UTF-8 encoded characters need up to four bytes. Table 16-2 shows the byte representations of UTF-8 encoded characters.
Table 16-2. UTF-8 byte representation
|
Character code range |
Bytes used |
Byte 1 |
Byte 2 |
Byte 3 |
Byte 4 |
|---|---|---|---|---|---|
|
|
1 |
| |||
|
|
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access