Chapter 4. Unicode Text Versus Bytes
Humans use text. Computers speak bytes.
Esther Nam and Travis Fischer, “Character Encoding and Unicode in Python”1
Python 3 introduced a sharp distinction between strings of human text and sequences of raw bytes. Implicit conversion of byte sequences to Unicode text is a thing of the past. This chapter deals with Unicode strings, binary sequences, and the encodings used to convert between them.
Depending on the kind of work you do with Python,
you may think that understanding Unicode is not important.
That’s unlikely, but anyway there is no escaping the str versus byte divide.
As a bonus, you’ll find that the specialized binary sequence types provide
features that the “all-purpose” Python 2 str type did not have.
In this chapter, we will visit the following topics:
-
Characters, code points, and byte representations
-
Unique features of binary sequences:
bytes,bytearray, andmemoryview -
Encodings for full Unicode and legacy character sets
-
Avoiding and dealing with encoding errors
-
Best practices when handling text files
-
The default encoding trap and standard I/O issues
-
Safe Unicode text comparisons with normalization
-
Utility functions for normalization, case folding, and brute-force diacritic removal
-
Proper sorting of Unicode text with
localeand the pyuca library -
Character metadata in the Unicode database
-
Dual-mode APIs that handle
strandbytes
What’s New in This Chapter
Support for Unicode in Python 3 has been comprehensive ...