Credit: Holger Krekel
Python has a first class
unicode type that you can use in place of
the plain bytestring
str type. It's
easy, once you accept the need to explicitly convert between a
bytestring and a Unicode string:
>>> german_ae = unicode('\xc3\xa4', 'utf8')
german_ae is a
unicode string representing the German
lowercase a with umlaut (i.e., diaeresis) character "ä". It has been
constructed from interpreting the bytestring '
\xc3\xa4' according to the specified UTF-8
encoding. There are many encodings, but UTF-8 is often used because it
is universal (UTF-8 can encode any Unicode string) and yet fully
compatible with the 7-bit ASCII set (any ASCII bytestring is a correct
Once you cross this barrier, life is easy! You can manipulate
this Unicode string in practically the same way as a plain
>>> sentence = "This is a " + german_ae >>> sentence2 = "Easy!" >>> para = ". ".join([sentence, sentence2])
para is a
Unicode string, because operations between a
unicode string and a bytestring
always result in a
string—unless they fail and raise an exception:
>>> bytestring = '\xc3\xa4' # Uuh, some non-ASCII bytestring! >>> german_ae += bytestring
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position 0: ordinal not in range(128)
The byte '
0xc3' is not a
valid character in the 7-bit ASCII encoding, and Python refuses to
guess an encoding. So, being explicit about encodings is the crucial
point for successfully using Unicode strings with Python.
Unicode is easy to handle in Python, if you respect a few guidelines and learn to deal with common problems. This is not to say that an efficient implementation of Unicode is an easy task. Luckily, as with other hard problems, you don't have to care much: you can just use the efficient implementation of Unicode that Python provides.
The most important issue is to fully accept the distinction
between a bytestring and a
string. As exemplified in this recipe's solution, you often need to
explicitly construct a
string by providing a bytestring and an encoding. Without an encoding,
a bytestring is basically meaningless, unless you happen to be lucky
and can just assume that the bytestring is text in ASCII.
The most common problem with using Unicode in Python arises when
you are doing some text manipulation where only some of your strings
unicode objects and others are
bytestrings. Python makes a shallow attempt to implicitly convert your
bytestrings to Unicode. It usually assumes an ASCII encoding, though,
which gives you
exceptions if you actually have non-ASCII bytes somewhere.
UnicodeDecodeError tells you that you mixed
Unicode and bytestrings in such a way that Python cannot (doesn't even
try to) guess the text your bytestring might represent.
Developers from many big Python projects have come up with
simple rules of thumb to prevent such runtime
UnicodeDecodeErrors, and the rules may be
summarized into one sentence: always do the conversion at IO barriers.
To express this same concept a bit more extensively:
Whenever your program receives text data "from the outside"
(from the network, from a file, from user input, etc.), construct
unicode objects immediately.
Find out the appropriate encoding, for example, from an HTTP
header, or look for an appropriate convention to determine the
encoding to use.
Whenever your program sends text data "to the outside" (to
the network, to some file, to the user, etc.), determine the
correct encoding, and convert your text to a bytestring with that
encoding. (Otherwise, Python attempts to convert Unicode to an
ASCII bytestring, likely producing
UnicodeEncodeErrors, which are just the
converse of the
With these two rules, you will solve most Unicode problems. If
you still get
either kind, look for where you forgot to properly construct a
unicode object, forgot to properly
convert back to an encoded bytestring, or ended up using an
inappropriate encoding due to some mistake. (It is quite possible that
such encoding mistakes are due to the user, or some other program that
is interacting with yours, not following the proper encoding rules or
In order to convert a Unicode string back to an encoded bytestring, you usually do something like:
>>> bytestring = german_ae.decode('latin1') >>> bytestring
bytestring is a German ae
character in the '
Note how '
\xe4' (in Latin1) and the
previously shown '
UTF-8) represent the same German character, but in different
By now, you can probably imagine why Python refuses to guess
among the hundreds of possible encodings. It's a crucial design
choice, based on one of the Zen of Python
principles: "In the face of ambiguity, resist the temptation to
guess." At any interactive Python shell prompt, enter the statement
import this to read all of the
important principles that make up the Zen of
Unicode is a huge topic, but a recommended book is
Unicode: A Primer, by Tony Graham (Hungry
Minds, Inc.)—details are available at http://www.menteith.com/unicode/primer/; and a
short but complete article from Joel Spolsky, "The Absolute Minimum
Every Software Developer Absolutely, Positively Must Know About
Unicode and Character Sets (No Excuses)!," located at http://www.joelonsoftware.com/articles/Unicode.html.
See also the Library Reference and
Python in a Nutshell documentation about the
unicode types and modules
codecs; also, Recipe 1.21 and Recipe 1.22.