Credit: Andy McKay, Chris Perkins, Sami Hangaslammi
Dictionaries are a good concrete representation for sets in Python, so operations such as intersections are common. Say you have two dictionaries (but pretend that they each contain thousands of items):
some_dict = { 'zope':'zzz', 'python':'rocks' } another_dict = { 'python':'rocks', 'perl':'$' }
Here’s a bad way to find their intersection that is very slow:
intersect = [] for item in some_dict.keys( ): if item in another_dict.keys( ): intersect.append(item) print "Intersects:", intersect
And here’s a good way that is simple and fast:
intersect = [] for item in some_dict.keys( ): if another_dict.has_key(item): intersect.append(item) print "Intersects:", intersect
In Python 2.2, the following is elegant and even faster:
print "Intersects:", [k for k in some_dict if k in another_dict]
And here’s an alternate approach that wins hands down in speed, for Python 1.5.2 and later:
print "Intersects:", filter(another_dict.has_key, some_dict.keys())
The
keys
method produces a list of all the keys of a dictionary. It can be
pretty tempting to fall into the trap of just using
in
, with this list as the righthand side, to test
for membership. However, in the first example,
you’re looping through all of
some_dict
, then each time looping through all of
another_dict
. If some_dict
has
N1 items, and another_dict
has N2 items, your intersection operation will
have a compute time proportional to the product of
N1x
N2.
(O(N1x N2) is the common
computer-science notation to indicate this.)
By using the
has_key
method, you are not looping on another_dict
any
more, but rather checking the key in the
dictionary’s hash table. The processing time for
has_key
is basically independent of dictionary
size, so the second approach is
O(N1). The difference is
quite substantial for large dictionaries! If the two dictionaries are
very different in size, it becomes important to use the smaller one
in the role of some_dict
, while the larger one
takes on the role of another_dict
(i.e., loop on
the keys of the smaller dictionary, thus picking the smaller
N1).
Python 2.2 lets you iterate on a dictionary’s keys directly, with the statement:
for key in dict
You can test membership with the equally elegant:
if key in dict
rather than the equivalent but syntactically less nice
dict.has_key(key)
. Combining these two small but
nice innovations of Python 2.2 with the list-comprehension notation
introduced in Python 2.0, we end up with a very elegant approach,
which is at the same time concise, clear, and quite speedy.
However, the fastest approach is the one that uses
filter
with the bound method another_dict.has_key
on the
list some_dict.keys
. A typical intersection of two
500-item dictionaries with 50% overlap, on a typical cheap machine of
today (AMD Athlon 1.4GHz, DDR2100 RAM, Mandrake Linux 8.1), took 710
microseconds using has_key
, 450 microseconds using
the Python 2.2 technique, and 280 microseconds using the
filter
-based way. While these speed differences
are almost substantial, they pale in comparison with the timing of
the bad way, for which a typical intersection took 22,600
microseconds—30 times longer than the simple way and 80 times
longer than the filter
-based way!
Here’s the timing code, which shows a typical
example of how one goes about measuring relative speeds of equivalent
Python constructs:
import time def timeo(fun, n=1000): def void( ): pass start = time.clock( ) for i in range(n): void( ) stend = time.clock( ) overhead = stend - start start = time.clock( ) for i in range(n): fun( ) stend = time.clock( ) thetime = stend-start return fun._ _name_ _, thetime-overhead to500 = {} for i in range(500): to500[i] = 1 evens = {} for i in range(0, 1000, 2): evens[i] = 1 def simpleway( ): result = [] for k in to500.keys( ): if evens.has_key(k): result.append(k) return result def pyth22way( ): return [k for k in to500 if k in evens] def filterway( ): return filter(evens.has_key, to500.keys( )) def badsloway( ): result = [] for k in to500.keys( ): if k in evens.keys( ): result.append(k) return result for f in simpleway, pyth22way, filterway, badsloway: print "%s: %.2f"%timeo(f)
You can save this code into a .py
file and run
it (a few times, on an otherwise quiescent machine, of course) with
python -O
to check how the timings of the various
constructs compare on any specific machine in which
you’re interested. (Note that this script requires
Python 2.2 or later.) Timing different code snippets to find out how
their relative speeds compare is an important Python technique, since
intuition is a notoriously unreliable guide to such relative-speed
comparisons. For detailed and general instruction on how to time
things, see the introduction to Chapter 17.
When applicable without having to use a lambda
form or a specially written function, filter
,
map
,
and
reduce
often offer the fastest solution to any given problem. Of course, a
clever Pythonista cares about speed only for those very, very few
operations where speed really matters more than clarity, simplicity,
and elegance! But these built-ins are pretty elegant in their own
way, too.
We don’t have a separate recipe for the union of the
keys of two dictionaries, but that’s because the
task is even easier, thanks to a dictionary’s
update
method:
def union_keys(some_dict, another_dict): temp_dict = some_dict.copy( ) temp_dict.update(another_dict) return temp_dict.keys( )
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.