By Alex Martelli, Anna Martelli Ravenscroft, David Ascher
Book Price: $49.95 USD
£35.50 GBP
PDF Price: $39.99
Cover | Table of Contents | Colophon
list, with the
string as its argument:thelist = list(thestring)
for statement:for c in thestring:
do_something_with(c)
for clause of a list comprehension:results = [do_something_with(c) for c in thestring]
map
built-in function:results = map(do_something, thestring)
map for much the same purpose, as long as what you
need to do with each character is call a function on it. Finally, you
can call the built-in type list to obtain a list
of the length-one substrings of the string (i.e., the
string's characters). If what you want is a set
whose elements are the string's characters, you can
call sets.Set with the string as the argument (in
Python 2.4, you can also call the built-in set in
just the same way):import sets
magic_chars = sets.Set('abracadabra')
poppins_chars = sets.Set('supercalifragilisticexpialidocious')
print ''.join(magic_chars & poppins_chars) # set intersection
acrd
ord and chr are for:>>> print ord('a')
97
>>> print chr(97)
a
ord also accepts as its
argument a Unicode string of length one, in which case it returns a
Unicode code value, up to 65536. To make a Unicode string of length
one from a numeric Unicode code value, use the built-in function
unichr:>>> print ord(u'\u2020') 8224 >>> print repr(unichr(8224)) u'\u2020'
ord, chr,
and unichr cover all the related needs. Note, in
particular, the huge difference between chr(n) and
str(n), which beginners sometimes confuse...:>>> print repr(chr(97)) 'a' >>> print repr(str(97)) '97'
chr takes as its argument a small integer and
returns the corresponding single-character string according to ASCII,
while str, called with any integer, returns the
string that is the decimal representation of that integer.map and ord
together, as follows:>>> print map(ord, 'ciao') [99, 105, 97, 111]
'.join,
map and chr; for example:>>> print ''.join(map(chr, range(97, 100)))
abc
chr,
ord, and unichr in the
Library Reference and Python in a
Nutshell.isinstance and
basestring, as
follows:
def isAString(anobj):
return isinstance(anobj, basestring)
def isExactlyAString(anobj):
return type(anobj) is type('')
str, and instances of any user-coded type that is
meant to be "string-like".isinstance built-in function, as
recommended in this recipe's Solution, is much
better. The built-in type basestring exists
exactly to enable this approach. basestring is a
common base class for the str and
unicode types, and any string-like type that user
code might define should also subclass basestring,
just to make sure that such isinstance testing
works as intended. basestring is essentially an
"empty" type, just like
object, so no cost is involved in subclassing it.isinstance checking
fails to accept such clearly string-like objects as instances of the
UserString class from Python Standard Library
module UserString, since that class, alas, does
not inherit from basestring.
If you need to support such types, you can check directly whether an
object behaves like a string—for example:def isStringLike(anobj):
try: anobj + ''
except: return False
else: return Trueljust, rjust, and
center methods of string objects are for. Each
takes a single argument, the width of the string you want as a
result, and returns a copy of the starting string with spaces added
on either or both sides:>>> print '|', 'hej'.ljust(20), '|', 'hej'.rjust(20), '|', 'hej'.center(20), '|'
| hej | hej | hej |
>>> print 'hej'.center(20, '+')
++++++++hej+++++++++
lstrip, rstrip, and
strip methods of string objects are for. Each
takes no argument and returns a copy of the starting string, shorn of
whitespace on either or both sides:>>> x = ' hej '
>>> print '|', x.lstrip( ), '|', x.rstrip( ), '|', x.strip( ), '|'
| hej | hej | hej |
>>> x = 'xyxxyy hejyx yyx'
>>> print '|'+x.strip('xy')+'|'
| hejyx |
yx'
that are followed by spaces: only all the occurrences of
'x' and 'y' at
either end of the string have been removed from the resulting string.join. Say that
pieces is a list whose items are strings, and you
want one big string with all the items concatenated in order; then,
you should code:largeString = ''.join(pieces)
% can often be even
handier:
largeString = '%s%s something %s yet more' % (small1, small2, small3)
+ operator concatenates strings and therefore
offers seemingly obvious solutions for putting small strings together
into a larger one. For example, when you have pieces stored in a few
variables, it seems quite natural to code something like:largeString = small1 + small2 + ' something ' + small3 + ' yet more'
largeString = ''
for piece in pieces:
largeString += piece
import operator largeString = reduce(operator.add, pieces, '')
revchars = astring[::-1]
revwords = astring.split( ) # string -> list of words revwords.reverse( ) # reverse the list in place revwords = ' '.join(revwords) # list of strings -> string
revwords = ' '.join(astring.split( )[::-1])
import re revwords = re.split(r'(\s+)', astring) # separators too, since '(...)' revwords.reverse( ) # reverse the list in place revwords = ''.join(revwords) # list of strings -> string
re.split with a regular expression
that includes a parenthesized group). Again, you could make a
one-liner, if you wished:revwords = ''.join(re.split(r'(\s+)', astring)[::-1])
reversed instead
of the less readable extended-slicing indicator
[::-1]:revwords = ' '.join(reversed(astring.split( ))) revwords = ''.join(reversed(re.split(r'(\s+)', astring)))
astring[::-1]
remains best, even in 2.4, because to use
reverseddef containsAny(seq, aset):
""" Check whether sequence seq contains ANY of the items in aset. """
for c in seq:
if c in aset: return True
return False
itertools
standard library module, essentially expressing the same approach in
a different way:
import itertools
def containsAny(seq, aset):
for item in itertools.ifilter(aset._ _contains_ _, seq):
return True
return False
set built-in type introduced in Python 2.4 (if
you're using Python 2.3, you can use the equivalent
sets.Set type from the Python Standard Library).
However, there are exceptions. Here, for example, a pure set-based
approach would be something like:def containsAny(seq, aset):
return bool(set(aset).intersection(seq))
False—we could never affirm that no item in
seq is a member of aset without
examining all the items, of course. But when the answer is
True, we often learn about that very soon, namely
as soon as we examine one item that is a member
of aset. Whether this matters at all is very
data-dependent, of course. It will make no practical difference when
seq is short, or when the answer is typically
translate method, but
find it hard to remember in detail how that method and the function
string.maketrans work, so you want a handy
facade to simplify their use in typical cases.translate method of strings is quite powerful
and flexible, as detailed in Recipe 1.10. However, exactly because of
that power and flexibility, it may be a nice idea to front it with a
"facade" that simplifies its
typical use. A little factory function,
returning a closure, can do wonders for this kind of
task:
import string
def translator(frm='', to='', delete='', keep=None):
if len(to) == 1:
to = to * len(frm)
trans = string.maketrans(frm, to)
if keep is not None:
allchars = string.maketrans('', '')
delete = allchars.translate(allchars, keep.translate(allchars, delete))
def translate(s):
return s.translate(trans, delete)
return translate
translate method for any one of a few purposes,
but each time I have to stop and think about the details (see Recipe 1.10 for more information
about those details). So, I wrote myself a class (later remade into
the factory closure presented in this recipe's
Solution) to encapsulate various possibilities behind a
simpler-to-use facade. Now, when I want a function that keeps only
characters from a given set, I can easily build and use that
function:>>> digits_only = translator(keep=string.digits)
>>> digits_only('Chris Perkins : 224-7992')
'2247992'
>>> no_digits = translator(delete=string.digits)
>>> no_digits('Chris Perkins : 224-7992')
'Chris Perkins : -'
translate method of string objects is fast and
handy for all tasks of this ilk. However, to call
translate effectively to solve this
recipe's task, we must do some advance preparation.
The first argument to translate is a translation
table: in this recipe, we do not want to do any translation, so we
must prepare a first argument that specifies "no
translation". The second argument to
translate specifies which characters we want to
delete: since the task here says that
we're given, instead, a set of characters to
keep (i.e., to not delete),
we must prepare a second argument that gives the set
complement—deleting all characters we must not
keep. A closure is the best way to do this advance preparation just
once, obtaining a fast filtering function tailored to our exact
needs:import string # Make a reusable string of all characters, which does double duty # as a translation table specifying "no translation whatsoever" allchars = string.maketrans('', '') def makefilter(keep): """ Return a function that takes a string and returns a partial copy of that string consisting of only the characters in 'keep'. Note that `keep' must be a plain string. """ # Make a string of all characters that are not in 'keep': the "set # complement" of keep, meaning the string of characters we must delete delchars = allchars.translate(allchars, keep) # Make and return the desired filtering function (as a closure) def thefilter(s): return s.translate(allchars, delchars) return thefilter if _ _name_ _ == '_ _main_ _': just_vowels = makefilter('aeiouy') print just_vowels('four score and seven years ago') # emits: ouoeaeeyeaao print just_vowels('tiger, tiger burning bright') # emits:
from _ _future_ _ import division # ensure / does NOT truncate
import string
text_characters = "".join(map(chr, range(32, 127))) + "\n\r\t\b"
_null_trans = string.maketrans("", "")
def istext(s, text_characters=text_characters, threshold=0.30):
# if s contains any null, it's not text:
if "\0" in s:
return False
# an "empty" string is "text" (arbitrary but reasonable choice):
if not s:
return True
# Get the substring of s made up of non-text characters
t = s.translate(_null_trans, text_characters)
# s is 'text' if less than 30% of its characters are non-text ones:
return len(t)/len(s) <= threshold
àèéìòù
",
to the text_characters argument.upper and lower methods of
string objects are for. Each takes no arguments and returns a copy of
the string in which each letter has been changed to upper- or
lowercase, respectively.big = little.upper( ) little = big.lower( )
s.capitalize is similar to s[:1].upper(
)+s[1:].lower( ): the first character is changed to
uppercase, and all others are changed to lowercase.
s.title is again similar, but it capitalizes the
first letter of each word (where a
"word" is a sequence of letters)
and uses lowercase for all other letters:>>> print 'one tWo thrEe'.capitalize( ) One two three >>> print 'one tWo thrEe'.title( ) One Two Three
isupper, islower, and
istitle, which all return True
if the string is not empty, contains at least one letter, and already
meets the uppercase, lowercase, or titlecase constraints. There is no
analogous iscapitalized method, and coding it is
not trivial, if we want behavior that's strictly
similar to strings' is...
methods. Those methods all return False for an
"empty" string, and the three
case-checking ones also return False for strings
that, while not empty, contain no letters at
all.
def iscapitalized(s):
return s == s.capitalize( )
afield = theline[3:8]
struct.unpack may be appropriate. For example:import struct # Get a 5-byte string, skip 3, get two 8-byte strings, then all the rest: baseformat = "5s 3x 8s 8s" # by how many bytes does theline exceed the length implied by this # base-format (24 bytes in this case, but struct.calcsize is general) numremain = len(theline) - struct.calcsize(baseformat) # complete the format with the appropriate 's' field, then unpack format = "%s %ds" % (baseformat, numremain) l, s1, s2, t = struct.unpack(format, theline)
all the
rest", then just unpack the initial part of
theline with the right length:l, s1, s2 = struct.unpack(baseformat, theline[:struct.calcsize(baseformat)])
fivers = [theline[k:k+5] for k in xrange(0, len(theline), 5)]
chars = list(theline)
cuts = [8, 14, 20, 26, 30] pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]
zip in this LC returns a list of pairs
of the form (cuts[k], cuts[k+1]), except that the
first pair is (0, cuts[0]), and the last one is
(cuts[len(cuts)-1], None). In
other words, each pair gives the right (i, j) for
slicing between each cut and the next, except that the first one is
for the slice before the first cut, and the last one is for the slice
from the last cut to the end of the string. The rest of the LC just
uses these pairs to cut up the appropriate slices of
def reindent(s, numSpaces):
leading_space = numSpaces * ' '
lines = [ leading_space + line.strip( )
for line in s.splitlines( ) ]
return '\n'.join(lines)
>>> x = """ line one ... line two ... and line three ... """ >>> print x line one line two and line three >>> print reindent(x, 4) line one line two and line three
def addSpaces(s, numAdd):
white = " "*numAdd
return white + white.join(s.splitlines(True))
def numSpaces(s):
return [len(line)-len(line.lstrip( )) for line in s.splitlines( )]
def delSpaces(s, numDel):
if numDel > min(numSpaces(s)):
raise ValueError, "removing more spaces than there are!"
return '\n'.join([ line[numDel:] for line in s.splitlines( ) ])expandtabs method. Because strings are immutable,
the method returns a new string object, a modified copy of the
original one. However, it's easy to rebind a string
variable name from the original to the modified-copy value:
mystring = mystring.expandtabs( )
expandtabs, by default, uses a tab
length of 8; you can pass expandtabs an integer
argument to use as the tab length.def unexpand(astring, tablen=8):
import re
# split into alternating space and non-space sequences
pieces = re.split(r'( +)', astring.expandtabs(tablen))
# keep track of the total length of the string so far
lensofar = 0
for i, piece in enumerate(pieces):
thislen = len(piece)
lensofar += thislen
if piece.isspace( ):
# change each space sequences into tabs+spaces
numblanks = lensofar % tablen
numtabs = (thislen-numblanks+tablen-1)/tablen
pieces[i] = '\t'*numtabs + ' '*numblanks
return ''.join(pieces)
def expand(format, d, marker='"', safe=False):
if safe:
def lookup(w): return d.get(w, w.join(marker*2))
else:
def lookup(w): return d[w]
parts = format.split(marker)
parts[1::2] = map(lookup, parts[1::2])
return ''.join(parts)
if _ _name_ _ == '_ _main_ _':
print expand('just "a" test', {'a': 'one'})
# emits: just one test
False,
the default, every marked substring must be found in dictionary
d, otherwise expand terminates with
a KeyError exception. When parameter
safe is explicitly passed as
True, marked substrings that are not found in the
dictionary are just left intact in the output string.KeyError exception gets raised for marked strings
not found in the dictionary. If not required to be safe (the
default), lookup just indexes into dictionary
d and raises an error if the substring is not found.
But, if lookup is required to be
"safe", it uses
d's method get
and supplies as the default the substring being looked up, with a
marker on either side. In this way, by passing safe
as True, you may choose to have unknown formatting
markers come right through to the output rather than raising
exceptions. marker+w+marker would be an obvious
alternative to the chosen w.join(marker*2), but
I've chosen the latter exactly to display a
non-obvious but interesting way to construct such a quoted string.string.Template class
for this purpose. Here is a snippet of code showing how to use that
class:import string
# make a template from a string where some identifiers are marked with $
new_style = string.Template('this is $thing')
# use the substitute method of the template with a dictionary argument:
print new_style.substitute({'thing':5}) # emits: this is 5
print new_style.substitute({'thing':'test'}) # emits: this is test
# alternatively, you can pass keyword-arguments to 'substitute':
print new_style.substitute(thing=5) # emits: this is 5
print new_style.substitute(thing='test') # emits: this is test
old_style = 'this is %(thing)s'
%, and
an s right after the closed parenthesis. Then, you
use the % operator, with the format string on the
left of the operator, and a dictionary on the
right:
print old_style % {'thing':5} # emits: this is 5
print old_style % {'thing':'test'} # emits: this is test
string.Template class offers a simpler
alternative.string.Template instance, you may
include a dollar sign ($) by doubling it, and you
may have the interpolated identifier immediately followed by letters
or digits by enclosing it in curly braces ({ }).
Here is an example that requires both of these refinements:form_letter = '''Dear $customer,
I hope you are having a great time.
If you do not find Room $room to your satisfaction,
let us know. Please accept this $$5 coupon.
Sincerely,
$manager
${name}Inn'''
letter_template = string.Template(form_letter)
print letter_template.substitute({'name':'Sleepy', 'customer':'Fred Smith',
'manager':'Barney Mills', 'room':307,
})sub method of
re objects (from the re module
in the standard library) makes regular expressions particularly good
at performing string substitutions. Here is a function returning a
modified copy of an input string, where each occurrence of any string
that's a key in a given dictionary is replaced by
the corresponding value in the dictionary:import re
def multiple_replace(text, adict):
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
return rx.sub(one_xlat, text)
re module to perform
single-pass multiple-string substitution using a dictionary.
Let's say you have a dictionary-based mapping
between strings. The keys are the set of strings you want to replace,
and the corresponding values are the strings with which to replace
them. You could perform the substitution by calling the string method
replace for each key/value pair in the dictionary,
thus processing and creating a new copy of the entire text several
times, but it is clearly better and faster to do all the changes in a
single pass, processing and creating a copy of the text only once.
re.sub's callback facility makes
this better approach quite easy.re.sub a replacement string, we pass it a callback
argument. s.endswith(end1) or s.endswith(end2) or
s.endswith(end3) and so on.itertools.imap function is just as handy for
this task as for many of a similar nature:import itertools
def anyTrue(predicate, sequence):
return True in itertools.imap(predicate, sequence)
def endsWith(s, *endings):
return anyTrue(s.endswith, endings)
import os
for filename in os.listdir('.'):
if endsWith(filename, '.jpg', '.jpeg', '.gif'):
print filename
s.startswith or s._
_contains_ _. Indeed, perhaps it would be better to do
without the helper function endsWith—after
all, directly coding if anyTrue(filename.endswith, (".jpg", ".gif", ".png")):
unicode type that you can
use in place of the plain bytestring str type.
It's easy, once you accept the need to explicitly
convert between a bytestring and a Unicode string:>>> german_ae = unicode('\xc3\xa4', 'utf8')
unicode string representing the German lowercase a
with umlaut (i.e., diaeresis) character
"ä". It has been constructed from
interpreting the bytestring '\xc3\xa4' according
to the specified UTF-8 encoding. There are many encodings, but UTF-8
is often used because it is universal (UTF-8 can encode any Unicode
string) and yet fully compatible with the 7-bit ASCII set (any ASCII
bytestring is a correct UTF-8-encoded string).str string:>>> sentence = "This is a " + german_ae >>> sentence2 = "Easy!" >>> para = ". ".join([sentence, sentence2])
Unicode string, because operations between a
unicode string and a bytestring always result in a
unicode string—unless they fail and raise an
exception:>>> bytestring = '\xc3\xa4' # Uuh, some non-ASCII bytestring! >>> german_ae += bytestring UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
0xc3' is not a valid character in the
7-bit ASCII encoding, and Python refuses to guess an encoding. So,
being explicit about encodings is the crucial point for successfully
using Unicode strings with Python.unicodestring = u"Hello world"
# Convert Unicode to plain Python string: "encode"
utf8string = unicodestring.encode("utf-8")
asciistring = unicodestring.encode("ascii")
isostring = unicodestring.encode("ISO-8859-1")
utf16string = unicodestring.encode("utf-16")
# Convert plain Python string to Unicode: "decode"
plainstring1 = unicode(utf8string, "utf-8")
plainstring2 = unicode(asciistring, "ascii")
plainstring3 = unicode(isostring, "ISO-8859-1")
plainstring4 = unicode(utf16string, "utf-16")
assert plainstring1 == plainstring2 == plainstring3 == plainstring4