Chapter 8Strings and Unicode

One of the more common sources of pain when writing Python applications is the handling of string data, specifically when strings contain characters outside of common Latin characters.

One of the first standards developed for representing string data is known as ASCII, which stands for American Standard Code for Information Interchange. ASCII defines a dictionary for representing common characters such as “A” through “Z” (in both upper- and lowercase), the digits “0” through “9,” and a few common symbols (such as period, question mark, and so on).

However, ASCII relies upon an assumption that each character maps to a single byte, and, therefore, runs into trouble because there are far too many characters. As a result, a standard known as Unicode is now used to render text.

In Python, there are two different kinds of string data: text strings and byte strings. It is also possible to convert one type to the other. It is important to understand which kind of data you are dealing with, and to consistently keep the kinds of data straight.

In this chapter, you learn about the difference between text strings and byte strings, and how the types are implemented in both Python 2 and Python 3. You also learn how to deal with common problems that can pop up when you're working with string data within Python programs.

Text String Versus Byte String

Data is consistently stored in bytes. Character sets such as ASCII and Unicode are responsible for using byte data ...

Get Professional Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.