Credit: Alex Martelli
You need to perform frequent tests for
membership in a sequence. The
O(N) behavior of repeated
in
operators hurts performance, but you
can’t switch to using just a dictionary, as you also
need the sequence’s order.
Say you need to append items to a list only if they’re not already in the list. The simple, naive solution is excellent but may be slow:
def addUnique1(baseList, otherList): for item in otherList: if item not in baseList: baseList.append(item)
If otherList
is large, it may be faster to build
an auxiliary dictionary:
def addUnique2(baseList, otherList): auxDict = {} for item in baseList: auxDict[item] = None for item in otherList: if not auxDict.has_key(item): baseList.append(item) auxDict[item] = None
For a list on which you must often perform membership tests, it may
be best to wrap the list, together with its auxiliary dictionary,
into a class. You can then define a special _ _contains_ _
method to
speed the in
operator. The dictionary must be
carefully maintained to stay in sync with the sequence.
Here’s a version that does the syncing just in time,
when a membership test is required and the dictionary is out of sync,
and works with Python 2.1 or later:
from _ _future_ _ import nested_scopes import UserList try: list._ _getitem_ _ except: Base = UserList.UserList else: Base = list class FunkyList(Base): def _ _init_ _(self, initlist=None): Base._ _init_ _(self, initlist) self._dict_ok = 0 def _ _contains_ _(self, item): if not self._dict_ok: self._dict = {} for item in self: self._dict[item] = 1 self._dict_ok = 1 return self._dict.has_key(item) def _wrapMethod(methname): _method = getattr(Base, methname) def wrapper(self, *args): # Reset 'dictionary OK' flag, then delegate self._dict_ok = 0 return _method(self, *args) setattr(FunkyList, methname, wrapper) for meth in 'setitem delitem setslice delslice iadd'.split( ): _wrapMethod('_ _%s_ _'%meth) for meth in 'append insert pop remove extend'.split( ): _wrapMethod(meth) del _wrapMethod
Python’s in
operator is extremely handy, but
it’s O(N)
when applied to an N-item sequence. If a
sequence is subject to frequent in
tests, and the
items are hashable, an auxiliary dictionary at the
sequence’s side can provide a signficant performance
boost. A membership check (using the in
operator)
on a sequence of N items is
O(N); if
M such tests are performed, the overall time is
O(M x
N). Preparing an auxiliary dictionary whose keys
are the sequence’s items is also roughly
O(N), but the
M tests are roughly
O(M), so overall we have
roughly
O(N+M).
This is rather less than O(N
x M) and can thus offer a
very substantial performance boost when M and
N are large.
Even better overall performance can often be obtained by permanently placing the auxiliary dictionary alongside the sequence, encapsulating both into one object. However, in this case, the dictionary must be maintained as the sequence is modified, so that it stays in sync with the actual membership of the sequence.
The FunkyList
class in this recipe, for example,
extends list
(UserList
in
Python 2.1) and delegates every method to it. However, each method
that can modify list membership is wrapped in a closure that resets a
flag asserting that the auxiliary dictionary is in sync. The
in
operator calls the _ _contains_ _
method when it is applied to an instance that has such a
method. The _ _contains_ _
method rebuilds the
auxiliary dictionary, unless the flag is set, proving that the
rebuilding is unnecessary.
If our program needs to run only on Python 2.2 and later versions, we
can rewrite the _ _contains_ _
method in a much
better way:
def __contains__(self, item): if not self.dict_ok: self._dict = dict(zip(self,self)) self.dict_ok = 1 return item in self._dict
The built-in type dict
, new in Python 2.2, lets us
build the auxiliary dictionary faster and more concisely.
Furthermore, the ability to test for membership in a dictionary
directly with the in
operator, also new in Python
2.2, has similar advantages in speed, clarity, and conciseness.
Instead of building and installing the wrapping closures for all the
mutating methods of the list into the FunkyList
class with the auxiliary function _wrapMethod
, we
could simply write all the needed def
s for the
wrapper methods in the body of FunkyList
, with the
advantage of extending backward portability to Python versions even
older than 2.1. Indeed, this is how I tackled the problem in the
first version of this recipe that I posted to the online Python
cookbook. However, the current version of the recipe has the
important advantage of minimizing boilerplate (repetitious plumbing
code that is boring and voluminous and thus a likely home for bugs).
Python’s advanced abilities for introspection and
dynamic modification give you a choice: you can build method
wrappers, as this recipe does, in a smart and concise way, or you can
choose to use the boilerplate approach anyway, if you
don’t mind repetitious code and prefer to avoid what
some would call the “black magic”
of advanced introspection and dynamic modification of class objects.
Performance characteristics depend on the actual pattern of membership tests versus membership modifications, and some careful profiling may be required to find the right approach for a given use. This recipe, however, caters well to a rather common pattern of use, where sequence-modifying operations tend to happen in bunches, followed by a period in which no sequence modification is performed, but several membership tests may be performed.
Rebuilding the dictionary when needed is far simpler than
incrementally maintaining it at each sequence-modifying step.
Incremental maintenance requires careful analysis of what is being
removed and of what is inserted, particularly upon such operations as
slice assignment. If that strategy is desired, the values in the
dictionary should probably be a count
of the
number of occurrences of each key’s value in the
sequence. A list of the indexes in which the value is present is
another possibility, but that takes even more work to maintain.
Depending on usage patterns, the strategy of incremental maintenance
can be substantially faster or slower.
Of course, all of this is necessary only if the sequence itself is
needed (i.e., if the order of items in the sequence is significant).
Otherwise, keeping just the dictionary is obviously simpler and more
effective. Again, the dictionary can map values to
count
s, if you the need the data structure to be,
in mathematical terms, a bag rather than a set.
An important requisite for any of these membership-test optimizations
is that the values in the sequence must be hashable (otherwise, of
course, they cannot be keys in a dictionary). For example, a list of
tuples might be subjected to this recipe’s
treatment, but for a list of lists the recipe as it stands is not
applicable. You can sometimes use cPickle.dumps
to
create dictionary keys—or, for somewhat different application
needs, the object’s id
—but
neither workaround is always fully applicable. In the case of
cPickle.dumps
, even when it is applicable, the
overhead may negate some or most of the optimization.
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.