book

Python Cookbook

by Alex Martelli, David Ascher

July 2002

Intermediate to advanced

608 pages

15h 46m

English

O'Reilly Media, Inc.

Read now

Unlock full access

The Design of the Book

Content preview from Python Cookbook

Converting Between Unicode and Plain Strings

Credit: David Ascher, Paul Prescod

Problem

You need to deal with data that doesn’t fit in the ASCII character set.

Solution

Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:

# Convert Unicode to plain Python string: "encode"
unicodestring = u"Hello world"
utf8string = unicodestring.encode("utf-8")
asciistring = unicodestring.encode("ascii")
isostring = unicodestring.encode("ISO-8859-1")
utf16string = unicodestring.encode("utf-16")

# Convert plain Python string to Unicode: "decode"
plainstring1 = unicode(utf8string, "utf-8")
plainstring2 = unicode(asciistring, "ascii")
plainstring3 = unicode(isostring, "ISO-8859-1")
plainstring4 = unicode(utf16string, "utf-16")

assert plainstring1==plainstring2==plainstring3==plainstring4

Discussion

If you find yourself dealing with text that contains non-ASCII characters, you have to learn about Unicode—what it is, how it works, and how Python uses it.

Unicode is a big topic. Luckily, you don’t need to know everything about Unicode to be able to solve real-world problems with it: a few basic bits of knowledge are enough. First, you must understand the difference between bytes and characters. In older, ASCII-centric languages and environments, bytes and characters are treated as the same thing. Since a byte can hold up to 256 values, these environments are limited to 256 characters. Unicode, on the other hand, has tens of thousands of characters. ...