Unicode I/O
Internally, Unicode strings are represented as sequences of 16-bit (UCS-2) or 32-bit (UCS-4) integer character values, depending on how Python is built. As in 8-bit strings, all characters are the same size, and most common string operations are simply extended to handle strings with a larger range of character values. However, whenever Unicode strings are converted to a stream of bytes, a number of issues arise. First, to preserve compatibility with existing software, it may be desirable to convert Unicode to an 8-bit representation compatible with software that expects to receive ASCII or other 8-bit data. Second, the use of 16-bit or 32-bit characters introduces problems related to byte ordering. For the Unicode character U+HHLL ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access