BUY THIS BOOK
Add to Cart

Print Book $49.95


Add to Cart

PDF $39.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £35.50

What is this?

Looking to Reprint or License this content?


Python Cookbook
Python Cookbook, Second Edition

By Alex Martelli, Anna Martelli Ravenscroft, David Ascher
Book Price: $49.95 USD
£35.50 GBP
PDF Price: $39.99

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Text
Credit: Fred L. Drake, Jr., PythonLabs
Text-processing applications form a substantial part of the application space for any scripting language, if only because everyone can agree that text processing is useful. Everyone has bits of text that need to be reformatted or transformed in various ways. The catch, of course, is that every application is just a little bit different from every other application, so it can be difficult to find just the right reusable code to work with different file formats, no matter how similar they are.
Sounds like an easy question, doesn't it? After all, we know it when we see it, don't we? Text is a sequence of characters, and it is distinguished from binary data by that very fact. Binary data, after all, is a sequence of bytes.
Unfortunately, all data enters our applications as a sequence of bytes. There's no library function we can call that will tell us whether a particular sequence of bytes represents text, although we can create some useful heuristics that tell us whether data can safely (not necessarily correctly) be handled as text. Recipe 1.11 shows just such a heuristic.
Python strings are immutable sequences of bytes or characters. Most of the ways we create and process strings treat them as sequences of characters, but many are just as applicable to sequences of bytes. Unicode strings are immutable sequences of Unicode characters: transformations of Unicode strings into and from plain strings use codecs (coder-decoders) objects that embody knowledge about the many standard ways in which sequences of characters can be represented by sequences of bytes (also known as encodings and character sets). Note that Unicode strings do not serve double duty as sequences of bytes. Recipe 1.20, Recipe 1.21, and Recipe 1.22 illustrate the fundamentals of Unicode in Python.
Okay, let's assume that our application knows from the context that it's looking at text. That's usually the best approach because that's where external input comes into play. We're looking at a file either because it has a well-known name and defined format (common in the "Unix" world) or because it has a well-known filename extension that indicates the format of the contents (common on Windows). But now we have a problem: we had to use the word
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Introduction
Credit: Fred L. Drake, Jr., PythonLabs
Text-processing applications form a substantial part of the application space for any scripting language, if only because everyone can agree that text processing is useful. Everyone has bits of text that need to be reformatted or transformed in various ways. The catch, of course, is that every application is just a little bit different from every other application, so it can be difficult to find just the right reusable code to work with different file formats, no matter how similar they are.
Sounds like an easy question, doesn't it? After all, we know it when we see it, don't we? Text is a sequence of characters, and it is distinguished from binary data by that very fact. Binary data, after all, is a sequence of bytes.
Unfortunately, all data enters our applications as a sequence of bytes. There's no library function we can call that will tell us whether a particular sequence of bytes represents text, although we can create some useful heuristics that tell us whether data can safely (not necessarily correctly) be handled as text. Recipe 1.11 shows just such a heuristic.
Python strings are immutable sequences of bytes or characters. Most of the ways we create and process strings treat them as sequences of characters, but many are just as applicable to sequences of bytes. Unicode strings are immutable sequences of Unicode characters: transformations of Unicode strings into and from plain strings use codecs (coder-decoders) objects that embody knowledge about the many standard ways in which sequences of characters can be represented by sequences of bytes (also known as encodings and character sets). Note that Unicode strings do not serve double duty as sequences of bytes. Recipe 1.20, Recipe 1.21, and Recipe 1.22 illustrate the fundamentals of Unicode in Python.
Okay, let's assume that our application knows from the context that it's looking at text. That's usually the best approach because that's where external input comes into play. We're looking at a file either because it has a well-known name and defined format (common in the "Unix" world) or because it has a well-known filename extension that indicates the format of the contents (common on Windows). But now we have a problem: we had to use the word
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Processing a String One Character at a Time
Credit: Luther Blissett
You want to process a string one character at a time.
You can build a list whose items are the string's characters (meaning that the items are strings, each of length of one—Python doesn't have a special type for "characters" as distinct from strings). Just call the built-in list, with the string as its argument:
thelist = list(thestring)
You may not even need to build the list, since you can loop directly on the string with a for statement:
for c in thestring:
    do_something_with(c)
or in the for clause of a list comprehension:
results = [do_something_with(c) for c in thestring]
or, with exactly the same effects as this list comprehension, you can call a function on each character with the map built-in function:
results = map(do_something, thestring)
In Python, characters are just strings of length one. You can loop over a string to access each of its characters, one by one. You can use map for much the same purpose, as long as what you need to do with each character is call a function on it. Finally, you can call the built-in type list to obtain a list of the length-one substrings of the string (i.e., the string's characters). If what you want is a set whose elements are the string's characters, you can call sets.Set with the string as the argument (in Python 2.4, you can also call the built-in set in just the same way):
import sets
magic_chars = sets.Set('abracadabra')
poppins_chars = sets.Set('supercalifragilisticexpialidocious')
print ''.join(magic_chars & poppins_chars)   # set intersection
acrd
            
The Library Reference section on sequences; Perl Cookbook Recipe 1.5.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Converting Between Characters and Numeric Codes
Credit: Luther Blissett
You need to turn a character into its numeric ASCII (ISO) or Unicode code, and vice versa.
That's what the built-in functions ord and chr are for:
>>> print ord('a')
97
>>> print chr(97)
a
            
The built-in function ord also accepts as its argument a Unicode string of length one, in which case it returns a Unicode code value, up to 65536. To make a Unicode string of length one from a numeric Unicode code value, use the built-in function unichr:
>>> print ord(u'\u2020')
8224
>>> print repr(unichr(8224))
u'\u2020'
            
It's a mundane task, to be sure, but it is sometimes useful to turn a character (which in Python just means a string of length one) into its ASCII or Unicode code, and vice versa. The built-in functions ord, chr, and unichr cover all the related needs. Note, in particular, the huge difference between chr(n) and str(n), which beginners sometimes confuse...:
>>> print repr(chr(97))
'a'
>>> print repr(str(97))
'97'
            
chr takes as its argument a small integer and returns the corresponding single-character string according to ASCII, while str, called with any integer, returns the string that is the decimal representation of that integer.
To turn a string into a list of character value codes, use the built-in functions map and ord together, as follows:
>>> print map(ord, 'ciao')
[99, 105, 97, 111]
To build a string from a list of character codes, use ''.join, map and chr; for example:
>>> print ''.join(map(chr, range(97, 100)))
abc
            
Documentation for the built-in functions chr, ord, and unichr in the Library Reference and Python in a Nutshell.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Testing Whether an Object Is String-like
Credit: Luther Blissett
You need to test if an object, typically an argument to a function or method you're writing, is a string (or more precisely, whether the object is string-like).
A simple and fast way to check whether something is a string or Unicode object is to use the built-ins isinstance and basestring, as follows:
def isAString(anobj):
    return isinstance(anobj, basestring)
The first approach to solving this recipe's problem that comes to many programmers' minds is type-testing:
def isExactlyAString(anobj):
    return type(anobj) is type('')
However, this approach is pretty bad, as it willfully destroys one of Python's greatest strengths—smooth, signature-based polymorphism. This kind of test would reject Unicode objects, instances of user-coded subclasses of str, and instances of any user-coded type that is meant to be "string-like".
Using the isinstance built-in function, as recommended in this recipe's Solution, is much better. The built-in type basestring exists exactly to enable this approach. basestring is a common base class for the str and unicode types, and any string-like type that user code might define should also subclass basestring, just to make sure that such isinstance testing works as intended. basestring is essentially an "empty" type, just like object, so no cost is involved in subclassing it.
Unfortunately, the canonical isinstance checking fails to accept such clearly string-like objects as instances of the UserString class from Python Standard Library module UserString, since that class, alas, does not inherit from basestring. If you need to support such types, you can check directly whether an object behaves like a string—for example:
def isStringLike(anobj):
    try: anobj + ''
    except: return False
    else: return True
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Aligning Strings
Credit: Luther Blissett
You want to align strings: left, right, or center.
That's what the ljust, rjust, and center methods of string objects are for. Each takes a single argument, the width of the string you want as a result, and returns a copy of the starting string with spaces added on either or both sides:
>>> print '|', 'hej'.ljust(20), '|', 'hej'.rjust(20), '|', 'hej'.center(20), '|'
| hej             |             hej |       hej       |
            
Centering, left-justifying, or right-justifying text comes up surprisingly often—for example, when you want to print a simple report with centered page numbers in a monospaced font. Because of this, Python string objects supply this functionality through three of their many methods. In Python 2.3, the padding character is always a space. In Python 2.4, however, while space-padding is still the default, you may optionally call any of these methods with a second argument, a single character to be used for the padding:
>>> print 'hej'.center(20, '+')
++++++++hej+++++++++
            
The Library Reference section on string methods; Java Cookbook recipe 3.5.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Trimming Space from the Ends of a String
Credit: Luther Blissett
You need to work on a string without regard for any extra leading or trailing spaces a user may have typed.
That's what the lstrip, rstrip, and strip methods of string objects are for. Each takes no argument and returns a copy of the starting string, shorn of whitespace on either or both sides:
>>> x = '    hej   '
>>> print '|', x.lstrip( ), '|', x.rstrip( ), '|', x.strip( ), '|'
| hej    |     hej | hej |
            
Just as you may need to add space to either end of a string to align that string left, right, or center in a field of fixed width (as covered previously in Recipe 1.4), so may you need to remove all whitespace (blanks, tabs, newlines, etc.) from either or both ends. Because this need is frequent, Python string objects supply this functionality through three of their many methods. Optionally, you may call each of these methods with an argument, a string composed of all the characters you want to trim from either or both ends instead of trimming whitespace characters:
>>> x = 'xyxxyy hejyx  yyx'
>>> print '|'+x.strip('xy')+'|'
| hejyx  |
            
Note that in these cases the leading and trailing spaces have been left in the resulting string, as have the 'yx' that are followed by spaces: only all the occurrences of 'x' and 'y' at either end of the string have been removed from the resulting string.
The Library Reference section on string methods; Recipe 1.4; Java Cookbook recipe 3.12.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Combining Strings
Credit: Luther Blissett
You have several small strings that you need to combine into one larger string.
To join a sequence of small strings into one large string, use the string operator join. Say that pieces is a list whose items are strings, and you want one big string with all the items concatenated in order; then, you should code:
largeString = ''.join(pieces)
To put together pieces stored in a few variables, the string-formatting operator % can often be even handier:
largeString = '%s%s something %s yet more' % (small1, small2, small3)
In Python, the + operator concatenates strings and therefore offers seemingly obvious solutions for putting small strings together into a larger one. For example, when you have pieces stored in a few variables, it seems quite natural to code something like:
largeString = small1 + small2 + ' something ' + small3 + ' yet more'
And similarly, when you have a sequence of small strings named pieces, it seems quite natural to code something like:
largeString = ''
for piece in pieces:
    largeString += piece
Or, equivalently, but more fancifully and compactly:
import operator
largeString = reduce(operator.add, pieces, '')
However, it's very important to realize that none of these seemingly obvious solution is good—the approaches shown in the "Solution" are vastly superior.
In Python, string objects are immutable. Therefore, any operation on a string, including string concatenation, produces a new string object, rather than modifying an existing one. Concatenating N strings thus involves building and then immediately throwing away each of N-1 intermediate results. Performance is therefore vastly better for operations that build no intermediate results, but rather produce the desired end result at once.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reversing a String by Words or Characters
Credit: Alex Martelli
You want to reverse the characters or words in a string.
Strings are immutable, so, to reverse one, we need to make a copy. The simplest approach for reversing is to take an extended slice with a "step" of -1, so that the slicing proceeds backwards:
revchars = astring[::-1]
To flip words, we need to make a list of words, reverse it, and join it back into a string with a space as the joiner:
revwords = astring.split( )     # string -> list of words
revwords.reverse( )             # reverse the list in place
revwords = ' '.join(revwords)  # list of strings -> string
or, if you prefer terse and compact "one-liners":
revwords = ' '.join(astring.split( )[::-1])
If you need to reverse by words while preserving untouched the intermediate whitespace, you can split by a regular expression:
import re
revwords = re.split(r'(\s+)', astring)         # separators too, since '(...)'
revwords.reverse( )        # reverse the list in place
revwords = ''.join(revwords)        # list of strings -> string
Note that the joiner must be the empty string in this case, because the whitespace separators are kept in the revwords list (by using re.split with a regular expression that includes a parenthesized group). Again, you could make a one-liner, if you wished:
revwords = ''.join(re.split(r'(\s+)', astring)[::-1])
but this is getting too dense and unreadable to be good Python code!
In Python 2.4, you may make the by-word one-liners more readable by using the new built-in function reversed instead of the less readable extended-slicing indicator [::-1]:
revwords = ' '.join(reversed(astring.split( )))
revwords = ''.join(reversed(re.split(r'(\s+)', astring)))
For the by-character case, though, astring[::-1] remains best, even in 2.4, because to use reversed
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Checking Whether a String Contains a Set of Characters
Credit: Jürgen Hermann, Horst Hansen
You need to check for the occurrence of any of a set of characters in a string.
The simplest approach is clear, fast, and general (it works for any sequence, not just strings, and for any container on which you can test for membership, not just sets):
def containsAny(seq, aset):
    """ Check whether sequence seq contains ANY of the items in aset. """
    for c in seq:
        if c in aset: return True
    return False
You can gain a little speed by moving to a higher-level, more sophisticated approach, based on the itertools standard library module, essentially expressing the same approach in a different way:
import itertools
def containsAny(seq, aset):
    for item in itertools.ifilter(aset._ _contains_ _, seq):
        return True
    return False
Most problems related to sets are best handled by using the set built-in type introduced in Python 2.4 (if you're using Python 2.3, you can use the equivalent sets.Set type from the Python Standard Library). However, there are exceptions. Here, for example, a pure set-based approach would be something like:
def containsAny(seq, aset):
    return bool(set(aset).intersection(seq))
However, with this approach, every item in seq inevitably has to be examined. The functions in this recipe's Solution, on the other hand, "short-circuit": they return as soon as they know the answer. They must still check every item in seq when the answer is False—we could never affirm that no item in seq is a member of aset without examining all the items, of course. But when the answer is True, we often learn about that very soon, namely as soon as we examine one item that is a member of aset. Whether this matters at all is very data-dependent, of course. It will make no practical difference when seq is short, or when the answer is typically
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Simplifying Usage of Strings' translate Method
Credit: Chris Perkins, Raymond Hettinger
You often want to use the fast code in strings' translate method, but find it hard to remember in detail how that method and the function string.maketrans work, so you want a handy facade to simplify their use in typical cases.
The translate method of strings is quite powerful and flexible, as detailed in Recipe 1.10. However, exactly because of that power and flexibility, it may be a nice idea to front it with a "facade" that simplifies its typical use. A little factory function, returning a closure, can do wonders for this kind of task:
import string
def translator(frm='', to='', delete='', keep=None):
    if len(to) == 1:
        to = to * len(frm)
    trans = string.maketrans(frm, to)
    if keep is not None:
        allchars = string.maketrans('', '')
        delete = allchars.translate(allchars, keep.translate(allchars, delete))
    def translate(s):
        return s.translate(trans, delete)
    return translate
I often find myself wanting to use strings' translate method for any one of a few purposes, but each time I have to stop and think about the details (see Recipe 1.10 for more information about those details). So, I wrote myself a class (later remade into the factory closure presented in this recipe's Solution) to encapsulate various possibilities behind a simpler-to-use facade. Now, when I want a function that keeps only characters from a given set, I can easily build and use that function:
>>> digits_only = translator(keep=string.digits)
>>> digits_only('Chris Perkins : 224-7992')
'2247992'
            
It's similarly simple when I want to remove a set of characters:
>>> no_digits = translator(delete=string.digits)
>>> no_digits('Chris Perkins : 224-7992')
'Chris Perkins : -'
            
and when I want to replace a set of characters with a single character:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Filtering a String for a Set of Characters
Credit: Jürgen Hermann, Nick Perkins, Peter Cogolo
Given a set of characters to keep, you need to build a filtering function that, applied to any string s, returns a copy of s that contains only characters in the set.
The translate method of string objects is fast and handy for all tasks of this ilk. However, to call translate effectively to solve this recipe's task, we must do some advance preparation. The first argument to translate is a translation table: in this recipe, we do not want to do any translation, so we must prepare a first argument that specifies "no translation". The second argument to translate specifies which characters we want to delete: since the task here says that we're given, instead, a set of characters to keep (i.e., to not delete), we must prepare a second argument that gives the set complement—deleting all characters we must not keep. A closure is the best way to do this advance preparation just once, obtaining a fast filtering function tailored to our exact needs:
import string
# Make a reusable string of all characters, which does double duty
# as a translation table specifying "no translation whatsoever"
allchars = string.maketrans('', '')
def makefilter(keep):
    """ Return a function that takes a string and returns a partial copy
        of that string consisting of only the characters in 'keep'.
        Note that `keep' must be a plain string.
    """
    # Make a string of all characters that are not in 'keep': the "set
    # complement" of keep, meaning the string of characters we must delete
    delchars = allchars.translate(allchars, keep)
    # Make and return the desired filtering function (as a closure)
    def thefilter(s):
        return s.translate(allchars, delchars)
    return thefilter
if _ _name_ _ == '_ _main_ _':
    just_vowels = makefilter('aeiouy')
    print just_vowels('four score and seven years ago')
# emits: ouoeaeeyeaao
    print just_vowels('tiger, tiger burning bright')
# emits: 
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Checking Whether a String Is Text or Binary
Credit: Andrew Dalke
Python can use a plain string to hold either text or arbitrary bytes, and you need to determine (heuristically, of course: there can be no precise algorithm for this) which of the two cases holds for a certain string.
We can use the same heuristic criteria as Perl does, deeming a string binary if it contains any nulls or if more than 30% of its characters have the high bit set (i.e., codes greater than 126) or are strange control codes. We have to code this ourselves, but this also means we easily get to tweak the heuristics for special application needs:
from _ _future_ _ import division           # ensure / does NOT truncate
import string
text_characters = "".join(map(chr, range(32, 127))) + "\n\r\t\b"
_null_trans = string.maketrans("", "")
def istext(s, text_characters=text_characters, threshold=0.30):
    # if s contains any null, it's not text:
    if "\0" in s:
        return False
    # an "empty" string is "text" (arbitrary but reasonable choice):
    if not s:
        return True
    # Get the substring of s made up of non-text characters
    t = s.translate(_null_trans, text_characters)
    # s is 'text' if less than 30% of its characters are non-text ones:
    return len(t)/len(s) <= threshold
You can easily do minor customizations to the heuristics used by function istext by passing in specific values for the threshold, which defaults to 0.30 (30%), or for the string of those characters that are to be deemed "text" (which defaults to normal ASCII characters plus the four "normal" control characters, meaning ones that are often found in text). For example, if you expected Italian text encoded as ISO-8859-1, you could add the accented letters used in Italian, "àèéìòù ", to the text_characters argument.
Often, what you need to check as being either binary or text is not a string, but a file. Again, we can use the same heuristics as Perl, checking just the first block of the file with the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Controlling Case
Credit: Luther Blissett
You need to convert a string from uppercase to lowercase, or vice versa.
That's what the upper and lower methods of string objects are for. Each takes no arguments and returns a copy of the string in which each letter has been changed to upper- or lowercase, respectively.
big = little.upper( )
little = big.lower( )
Characters that are not letters are copied unchanged.
s.capitalize is similar to s[:1].upper( )+s[1:].lower( ): the first character is changed to uppercase, and all others are changed to lowercase. s.title is again similar, but it capitalizes the first letter of each word (where a "word" is a sequence of letters) and uses lowercase for all other letters:
>>> print 'one tWo thrEe'.capitalize( )
One two three
>>> print 'one tWo thrEe'.title( )
One Two Three
            
Case manipulation of strings is a very frequent need. Because of this, several string methods let you produce case-altered copies of strings. Moreover, you can also check whether a string object is already in a given case form, with the methods isupper, islower, and istitle, which all return True if the string is not empty, contains at least one letter, and already meets the uppercase, lowercase, or titlecase constraints. There is no analogous iscapitalized method, and coding it is not trivial, if we want behavior that's strictly similar to strings' is... methods. Those methods all return False for an "empty" string, and the three case-checking ones also return False for strings that, while not empty, contain no letters at all.
The simplest and clearest way to code iscapitalized is clearly:
def iscapitalized(s):
    return s == s.capitalize( )
However, this version deviates from the boundary-case semantics of the analogous
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Accessing Substrings
Credit: Alex Martelli
You want to access portions of a string. For example, you've read a fixed-width record and want to extract the record's fields.
Slicing is great, but it only does one field at a time:
afield = theline[3:8]
If you need to think in terms of field lengths, struct.unpack may be appropriate. For example:
import struct
# Get a 5-byte string, skip 3, get two 8-byte strings, then all the rest:
baseformat = "5s 3x 8s 8s"
# by how many bytes does theline exceed the length implied by this
# base-format (24 bytes in this case, but struct.calcsize is general)
numremain = len(theline) - struct.calcsize(baseformat)
# complete the format with the appropriate 's' field, then unpack
format = "%s %ds" % (baseformat, numremain)
l, s1, s2, t = struct.unpack(format, theline)
If you want to skip rather than get "all the rest", then just unpack the initial part of theline with the right length:
l, s1, s2 = struct.unpack(baseformat, theline[:struct.calcsize(baseformat)])
If you need to split at five-byte boundaries, you can easily code a list comprehension (LC) of slices:
fivers = [theline[k:k+5] for k in xrange(0, len(theline), 5)]
Chopping a string into individual characters is of course easier:
chars = list(theline)
If you prefer to think of your data as being cut up at specific columns, slicing with LCs is generally handier:
cuts = [8, 14, 20, 26, 30]
pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]
The call to zip in this LC returns a list of pairs of the form (cuts[k], cuts[k+1]), except that the first pair is (0, cuts[0]), and the last one is (cuts[len(cuts)-1], None). In other words, each pair gives the right (i, j) for slicing between each cut and the next, except that the first one is for the slice before the first cut, and the last one is for the slice from the last cut to the end of the string. The rest of the LC just uses these pairs to cut up the appropriate slices of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Changing the Indentation of a Multiline String
Credit: Tom Good
You have a string made up of multiple lines, and you need to build another string from it, adding or removing leading spaces on each line so that the indentation of each line is some absolute number of spaces.
The methods of string objects are quite handy, and let us write a simple function to perform this task:
def reindent(s, numSpaces):
    leading_space = numSpaces * ' '
    lines = [ leading_space + line.strip( )
              for line in s.splitlines( ) ]
    return '\n'.join(lines)
When working with text, it may be necessary to change the indentation level of a block. This recipe's code adds leading spaces to or removes them from each line of a multiline string so that the indentation level of each line matches some absolute number of spaces. For example:
>>> x = """  line one
...     line two
...  and line three
... """
>>> print x
  line one
                   line two
                and line three
>>> print reindent(x, 4)
    line one
                   line two
                   and line three
            
Even if the lines in s are initially indented differently, this recipe makes their indentation homogeneous, which is sometimes what we want, and sometimes not. A frequent need is to adjust the amount of leading spaces in each line, so that the relative indentation of each line in the block is preserved. This is not difficult for either positive or negative values of the adjustment. However, negative values need a check to ensure that no nonspace characters are snipped from the start of the lines. Thus, we may as well split the functionality into two functions to perform the transformations, plus one to measure the number of leading spaces of each line and return the result as a list:
def addSpaces(s, numAdd):
    white = " "*numAdd
    return white + white.join(s.splitlines(True))
def numSpaces(s):
    return [len(line)-len(line.lstrip( )) for line in s.splitlines( )]
def delSpaces(s, numDel):
    if numDel > min(numSpaces(s)):
        raise ValueError, "removing more spaces than there are!"
    return '\n'.join([ line[numDel:] for line in s.splitlines( ) ])
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Expanding and Compressing Tabs
Credit: Alex Martelli, David Ascher
You want to convert tabs in a string to the appropriate number of spaces, or vice versa.
Changing tabs to the appropriate number of spaces is a reasonably frequent task, easily accomplished with Python strings' expandtabs method. Because strings are immutable, the method returns a new string object, a modified copy of the original one. However, it's easy to rebind a string variable name from the original to the modified-copy value:
               mystring = mystring.expandtabs( )
This doesn't change the string object to which mystring originally referred, but it does rebind the name mystring to a newly created string object, a modified copy of mystring in which tabs are expanded into runs of spaces. expandtabs, by default, uses a tab length of 8; you can pass expandtabs an integer argument to use as the tab length.
Changing spaces into tabs is a rare and peculiar need. Compression, if that's what you're after, is far better performed in other ways, so Python doesn't offer a built-in way to "unexpand" spaces into tabs. We can, of course, write our own function for the purpose. String processing tends to be fastest in a split/process/rejoin approach, rather than with repeated overall string transformations:
def unexpand(astring, tablen=8):
    import re
    # split into alternating space and non-space sequences
    pieces = re.split(r'( +)', astring.expandtabs(tablen))
    # keep track of the total length of the string so far
    lensofar = 0
    for i, piece in enumerate(pieces):
        thislen = len(piece)
        lensofar += thislen
        if piece.isspace( ):
            # change each space sequences into tabs+spaces
            numblanks = lensofar % tablen
            numtabs = (thislen-numblanks+tablen-1)/tablen
            pieces[i] = '\t'*numtabs + ' '*numblanks
    return ''.join(pieces)
Function
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Interpolating Variables in a String
Credit: Scott David Daniels
You need a simple way to get a copy of a string where specially marked substrings are replaced with the results of looking up the substrings in a dictionary.
Here is a solution that works in Python 2.3 as well as in 2.4:
def expand(format, d, marker='"', safe=False):
    if safe:
        def lookup(w): return d.get(w, w.join(marker*2))
    else:
        def lookup(w): return d[w]
    parts = format.split(marker)
    parts[1::2] = map(lookup, parts[1::2])
    return ''.join(parts)
if _ _name_ _ == '_ _main_ _':
    print expand('just "a" test', {'a': 'one'})
# emits: just one test
            
When the parameter safe is False, the default, every marked substring must be found in dictionary d, otherwise expand terminates with a KeyError exception. When parameter safe is explicitly passed as True, marked substrings that are not found in the dictionary are just left intact in the output string.
The code in the body of the expand function has some points of interest. It defines one of two different nested functions (with the name of lookup either way), depending on whether the expansion is required to be safe. Safe means no KeyError exception gets raised for marked strings not found in the dictionary. If not required to be safe (the default), lookup just indexes into dictionary d and raises an error if the substring is not found. But, if lookup is required to be "safe", it uses d's method get and supplies as the default the substring being looked up, with a marker on either side. In this way, by passing safe as True, you may choose to have unknown formatting markers come right through to the output rather than raising exceptions. marker+w+marker would be an obvious alternative to the chosen w.join(marker*2), but I've chosen the latter exactly to display a non-obvious but interesting way to construct such a quoted string.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Interpolating Variables in a Stringin Python 2.4
Credit: John Nielsen, Lawrence Oluyede, Nick Coghlan
Using Python 2.4, you need a simple way to get a copy of a string where specially marked identifiers are replaced with the results of looking up the identifiers in a dictionary.
Python 2.4 offers the new string.Template class for this purpose. Here is a snippet of code showing how to use that class:
import string
# make a template from a string where some identifiers are marked with $
new_style = string.Template('this is $thing')
# use the substitute method of the template with a dictionary argument:
print new_style.substitute({'thing':5})      # emits: this is 5
print new_style.substitute({'thing':'test'}) # emits: this is test
# alternatively, you can pass keyword-arguments to 'substitute':
print new_style.substitute(thing=5)          # emits: this is 5
print new_style.substitute(thing='test')     # emits: this is test
In Python 2.3, a format string for identifier-substitution has to be expressed in a less simple format:
old_style = 'this is %(thing)s'
with the identifier in parentheses after a %, and an s right after the closed parenthesis. Then, you use the % operator, with the format string on the left of the operator, and a dictionary on the right:
print old_style % {'thing':5}      # emits: this is 5
print old_style % {'thing':'test'} # emits: this is test
Of course, this code keeps working in Python 2.4, too. However, the new string.Template class offers a simpler alternative.
When you build a string.Template instance, you may include a dollar sign ($) by doubling it, and you may have the interpolated identifier immediately followed by letters or digits by enclosing it in curly braces ({ }). Here is an example that requires both of these refinements:
form_letter = '''Dear $customer,
I hope you are having a great time.
If you do not find Room $room to your satisfaction,
let us know. Please accept this $$5 coupon.
            Sincerely,
            $manager
            ${name}Inn'''
letter_template = string.Template(form_letter)
print letter_template.substitute({'name':'Sleepy', 'customer':'Fred Smith',
                                  'manager':'Barney Mills', 'room':307,
                                 })
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Replacing Multiple Patterns in a Single Pass
Credit: Xavier Defrang, Alex Martelli
You need to perform several string substitutions on a string.
Sometimes regular expressions afford the fastest solution even in cases where their applicability is not obvious. The powerful sub method of re objects (from the re module in the standard library) makes regular expressions particularly good at performing string substitutions. Here is a function returning a modified copy of an input string, where each occurrence of any string that's a key in a given dictionary is replaced by the corresponding value in the dictionary:
import re
def multiple_replace(text, adict):
    rx = re.compile('|'.join(map(re.escape, adict)))
    def one_xlat(match):
        return adict[match.group(0)]
    return rx.sub(one_xlat, text)
This recipe shows how to use the Python standard re module to perform single-pass multiple-string substitution using a dictionary. Let's say you have a dictionary-based mapping between strings. The keys are the set of strings you want to replace, and the corresponding values are the strings with which to replace them. You could perform the substitution by calling the string method replace for each key/value pair in the dictionary, thus processing and creating a new copy of the entire text several times, but it is clearly better and faster to do all the changes in a single pass, processing and creating a copy of the text only once. re.sub's callback facility makes this better approach quite easy.
First, we have to build a regular expression from the set of keys we want to match. Such a regular expression has a pattern of the form a1|a2|...|aN, made up of the N strings to be substituted, joined by vertical bars, and it can easily be generated using a one-liner, as shown in the recipe. Then, instead of giving re.sub a replacement string, we pass it a callback argument.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Checking a String for Any of Multiple Endings
Credit: Michele Simionato
For a certain string s, you must check whether s has any of several endings; in other words, you need a handy, elegant equivalent of s.endswith(end1) or s.endswith(end2) or s.endswith(end3) and so on.
The itertools.imap function is just as handy for this task as for many of a similar nature:
import itertools
def anyTrue(predicate, sequence):
    return True in itertools.imap(predicate, sequence)
def endsWith(s, *endings):
    return anyTrue(s.endswith, endings)
A typical use for endsWith might be to print all names of image files in the current directory:
import os
for filename in os.listdir('.'):
    if endsWith(filename, '.jpg', '.jpeg', '.gif'):
       print filename
The same general idea shown in this recipe's Solution is easily applied to other tasks related to checking a string for any of several possibilities. The auxiliary function anyTrue is general and fast, and you can pass it as its first argument (the predicate) other bound methods, such as s.startswith or s._ _contains_ _. Indeed, perhaps it would be better to do without the helper function endsWith—after all, directly coding
    if anyTrue(filename.endswith, (".jpg", ".gif", ".png")):
seems to be already readable enough.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Handling International Text with Unicode
Credit: Holger Krekel
You need to deal with text strings that include non-ASCII characters.
Python has a first class unicode type that you can use in place of the plain bytestring str type. It's easy, once you accept the need to explicitly convert between a bytestring and a Unicode string:
>>> german_ae = unicode('\xc3\xa4', 'utf8')
Here german_ae is a unicode string representing the German lowercase a with umlaut (i.e., diaeresis) character "ä". It has been constructed from interpreting the bytestring '\xc3\xa4' according to the specified UTF-8 encoding. There are many encodings, but UTF-8 is often used because it is universal (UTF-8 can encode any Unicode string) and yet fully compatible with the 7-bit ASCII set (any ASCII bytestring is a correct UTF-8-encoded string).
Once you cross this barrier, life is easy! You can manipulate this Unicode string in practically the same way as a plain str string:
>>> sentence = "This is a " + german_ae
>>> sentence2 = "Easy!"
>>> para = ". ".join([sentence, sentence2])
Note that para is a Unicode string, because operations between a unicode string and a bytestring always result in a unicode string—unless they fail and raise an exception:
>>> bytestring = '\xc3\xa4'     # Uuh, some non-ASCII bytestring!
>>> german_ae += bytestring
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in 
               position 0: ordinal not in range(128)
            
The byte '0xc3' is not a valid character in the 7-bit ASCII encoding, and Python refuses to guess an encoding. So, being explicit about encodings is the crucial point for successfully using Unicode strings with Python.
Unicode is easy to handle in Python, if you respect a few guidelines and learn to deal with common problems. This is not to say that an efficient implementation of Unicode is an easy task. Luckily, as with other hard problems, you don't have to care much: you can just use the efficient implementation of Unicode that Python provides.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Converting Between Unicode and Plain Strings
Credit: David Ascher, Paul Prescod
You need to deal with textual data that doesn't necessarily fit in the ASCII character set.
Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:
unicodestring = u"Hello world"
# Convert Unicode to plain Python string: "encode"
utf8string = unicodestring.encode("utf-8")
asciistring = unicodestring.encode("ascii")
isostring = unicodestring.encode("ISO-8859-1")
utf16string = unicodestring.encode("utf-16")
# Convert plain Python string to Unicode: "decode"
plainstring1 = unicode(utf8string, "utf-8")
plainstring2 = unicode(asciistring, "ascii")
plainstring3 = unicode(isostring, "ISO-8859-1")
plainstring4 = unicode(utf16string, "utf-16")
assert plainstring1 == plainstring2 == plainstring3 == plainstring4
If you find yourself dealing with text that contains non-ASCII characters, you have to learn about Unicode—what it is, how it works, and how Python uses it. The preceding Recipe 1.20 offers minimal but crucial practical tips, and this recipe tries to offer more perspective.
You don't need to know everything about Unicode to be able to solve real-world problems with it, but a few basic tidbits of knowledge are indispensable. First, you must understand the difference between bytes and characters. In older, ASCII-centric languages and environments, bytes and characters are treated as if they were the same thing. A byte can hold up to 256 different values, so these environments are limited to dealing with no more than 256 distinct characters. Unicode, on the other hand, has tens of thousands of characters, which means that each Unicode character takes more than one byte; thus you need to make the distinction between characters and bytes.
Standard Python strings are really bytestrings, and a Python character, being such a string of length 1, is really a byte. Other terms for an instance of the standard Python string type are
Additional content appearing in this section has been removed.
Purchase this book now