Performing Frequent Membership Tests on a Sequence

Credit: Alex Martelli

Problem

You need to perform frequent tests for membership in a sequence. The O(N) behavior of repeated in operators hurts performance, but you can’t switch to using just a dictionary, as you also need the sequence’s order.

Solution

Say you need to append items to a list only if they’re not already in the list. The simple, naive solution is excellent but may be slow:

def addUnique1(baseList, otherList):
    for item in otherList:
        if item not in baseList:
            baseList.append(item)

If otherList is large, it may be faster to build an auxiliary dictionary:

def addUnique2(baseList, otherList):
    auxDict = {}
    for item in baseList:
        auxDict[item] = None
    for item in otherList:
        if not auxDict.has_key(item):
            baseList.append(item)
            auxDict[item] = None

For a list on which you must often perform membership tests, it may be best to wrap the list, together with its auxiliary dictionary, into a class. You can then define a special _ _contains_ _ method to speed the in operator. The dictionary must be carefully maintained to stay in sync with the sequence. Here’s a version that does the syncing just in time, when a membership test is required and the dictionary is out of sync, and works with Python 2.1 or later:

from _ _future_ _ import nested_scopes
import UserList
try: list._ _getitem_ _
except: Base = UserList.UserList
else: Base = list

class FunkyList(Base):
    def _ _init_ _(self, initlist=None):
        Base._ _init_ _(self, initlist)
        self._dict_ok = 0
    def _ _contains_ _(self, item):
        if not self._dict_ok:
            self._dict = {}
            for item in self:
                self._dict[item] = 1
            self._dict_ok = 1
        return self._dict.has_key(item)

def _wrapMethod(methname):
    _method = getattr(Base, methname)
    def wrapper(self, *args):
        # Reset 'dictionary OK' flag, then delegate
        self._dict_ok = 0
        return _method(self, *args)
    setattr(FunkyList, methname, wrapper)
for meth in 'setitem delitem setslice delslice iadd'.split(  ):
    _wrapMethod('_ _%s_ _'%meth)
for meth in 'append insert pop remove extend'.split(  ):
    _wrapMethod(meth)
del _wrapMethod

Discussion

Python’s in operator is extremely handy, but it’s O(N) when applied to an N-item sequence. If a sequence is subject to frequent in tests, and the items are hashable, an auxiliary dictionary at the sequence’s side can provide a signficant performance boost. A membership check (using the in operator) on a sequence of N items is O(N); if M such tests are performed, the overall time is O(M x N). Preparing an auxiliary dictionary whose keys are the sequence’s items is also roughly O(N), but the M tests are roughly O(M), so overall we have roughly O(N+M). This is rather less than O(N x M) and can thus offer a very substantial performance boost when M and N are large.

Even better overall performance can often be obtained by permanently placing the auxiliary dictionary alongside the sequence, encapsulating both into one object. However, in this case, the dictionary must be maintained as the sequence is modified, so that it stays in sync with the actual membership of the sequence.

The FunkyList class in this recipe, for example, extends list (UserList in Python 2.1) and delegates every method to it. However, each method that can modify list membership is wrapped in a closure that resets a flag asserting that the auxiliary dictionary is in sync. The in operator calls the _ _contains_ _ method when it is applied to an instance that has such a method. The _ _contains_ _ method rebuilds the auxiliary dictionary, unless the flag is set, proving that the rebuilding is unnecessary.

If our program needs to run only on Python 2.2 and later versions, we can rewrite the _ _contains_ _ method in a much better way:

def __contains__(self, item):
     if not self.dict_ok:
         self._dict = dict(zip(self,self))
         self.dict_ok = 1
     return item in self._dict

The built-in type dict, new in Python 2.2, lets us build the auxiliary dictionary faster and more concisely. Furthermore, the ability to test for membership in a dictionary directly with the in operator, also new in Python 2.2, has similar advantages in speed, clarity, and conciseness.

Instead of building and installing the wrapping closures for all the mutating methods of the list into the FunkyList class with the auxiliary function _wrapMethod, we could simply write all the needed defs for the wrapper methods in the body of FunkyList, with the advantage of extending backward portability to Python versions even older than 2.1. Indeed, this is how I tackled the problem in the first version of this recipe that I posted to the online Python cookbook. However, the current version of the recipe has the important advantage of minimizing boilerplate (repetitious plumbing code that is boring and voluminous and thus a likely home for bugs). Python’s advanced abilities for introspection and dynamic modification give you a choice: you can build method wrappers, as this recipe does, in a smart and concise way, or you can choose to use the boilerplate approach anyway, if you don’t mind repetitious code and prefer to avoid what some would call the “black magic” of advanced introspection and dynamic modification of class objects.

Performance characteristics depend on the actual pattern of membership tests versus membership modifications, and some careful profiling may be required to find the right approach for a given use. This recipe, however, caters well to a rather common pattern of use, where sequence-modifying operations tend to happen in bunches, followed by a period in which no sequence modification is performed, but several membership tests may be performed.

Rebuilding the dictionary when needed is far simpler than incrementally maintaining it at each sequence-modifying step. Incremental maintenance requires careful analysis of what is being removed and of what is inserted, particularly upon such operations as slice assignment. If that strategy is desired, the values in the dictionary should probably be a count of the number of occurrences of each key’s value in the sequence. A list of the indexes in which the value is present is another possibility, but that takes even more work to maintain. Depending on usage patterns, the strategy of incremental maintenance can be substantially faster or slower.

Of course, all of this is necessary only if the sequence itself is needed (i.e., if the order of items in the sequence is significant). Otherwise, keeping just the dictionary is obviously simpler and more effective. Again, the dictionary can map values to counts, if you the need the data structure to be, in mathematical terms, a bag rather than a set.

An important requisite for any of these membership-test optimizations is that the values in the sequence must be hashable (otherwise, of course, they cannot be keys in a dictionary). For example, a list of tuples might be subjected to this recipe’s treatment, but for a list of lists the recipe as it stands is not applicable. You can sometimes use cPickle.dumps to create dictionary keys—or, for somewhat different application needs, the object’s id—but neither workaround is always fully applicable. In the case of cPickle.dumps, even when it is applicable, the overhead may negate some or most of the optimization.

Python Cookbook by

Performing Frequent Membership Tests on a Sequence

Problem

Solution

Discussion

See Also

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly