Credit: Alex Martelli
You have a sequence that may include duplicates, and you need to remove the duplicates in the fastest possible way. Also, the output sequence must respect the item ordering of the input sequence.
The need to respect the item ordering of the input sequence means
that picking unique items will be a very different problem than that
explored in Recipe 17.4. This kind of need
often arises in conjunction with a function f
that
defines an equivalence relation among items (i.e.,
x
is equivalent to y
if and
only if f(x)==f(y)
), in which case the need to
remove duplicates may be better described as picking the first
representative of each occurring equivalence class:
# f defines an equivalence relation among items of sequence seq, and # f(x) must be hashable for each item x of seq (e.g., cPickle.dumps) def uniquer(seq, f=None): """ Keeps earliest occurring item of each f-defined equivalence class """ if f is None: # f's default is the identity function def f(x): return x already_seen = {} result = [] for item in seq: marker = f(item) # Python 2.2-ism; in older Pythons, use not already_seen.get(marker, 0) if marker not in already_seen: already_seen[marker] = 1 result.append(item) return result
Picking the most recent (last occurring) representative of each equivalence class is a bit harder:
def uniquest(seq, f=None):
""" Keeps last occurring item of each f-defined equivalence class.
However, it's O(N+N1*log(N1)), in which N1 is the count of "unique" items. """
import sys
if f is None:
def f(x): return x
already_seen = {}
for item, index in zip(seq, xrange(sys.maxint)):
marker = f(item)
already_seen[marker] = index, item
auxlist = already_seen.values( )
auxlist.sort( ) # the O(N1*log(N1)) step
return [item for index, item in auxlist]
def uniquique(seq, f=None):
""" Keeps last occurring item of each f-defined equivalence class.
O(N), but slower than uniquest in many practical cases. """
if f is None:
def f(x): return x
already_seen = {}
result = []
seq = list(seq)
seq.reverse( )
for item in seq:
marker = f(item)
# Python 2.2-ism; in older Pythons, use not already_seen.get(marker, 0)
if marker not in already_seen:
already_seen[marker] = 1
result.append(item)
result.reverse( )
return result
def uniquoque(seq, f=None):
""" Keeps last occurring item of each f-defined equivalence class.
Also O(N). """
import sys
if f is None:
def f(x): return x
where_seen = {}
output_this_item = [0]*len(seq)
for item, index in zip(seq, xrange(sys.maxint)):
marker = f(item)
previously_seen = where_seen.get(marker)
if previously_seen is not None:
output_this_item[previously_seen] = 0
output_this_item[index] = 1
where_seen[marker] = index
return [item for item, output_this in zip(seq, output_this_item)
if output_this]
These functions can be made more general (without adding substantial
complication) by adding another argument p
, which
is a function that picks the most suitable item of each equivalence
class, either when presented with a pair of candidates (index and
item) or with a list of indexes and items for each whole equivalence
class:
def fancy_unique(seq, f, p): """ Keeps "most-appropriate" item of each f-defined equivalence class, with precedence function p doing pairwise choice of (index, item) """ already_seen = {} for item, index in zip(seq, xrange(sys.maxint)): marker = f(item) if already_seen.has_key(marker): # or, "if marker in already_seen" # It's NOT a problem to rebind index and item within the # for loop: the next leg of the loop does not use their binding index, item = p((index, item), already_seen[marker]) already_seen[marker] = index, item auxlist = already_seen.values( ) auxlist.sort( ) return [item for index, item in auxlist] def fancier_uniquer(seq, f, p): """ Keeps "most-appropriate" item of each f-defined equivalence class, with precedence function p choosing appropriate (index, item) for each equivalence class from the list of candidates passed to it """ already_seen = {} for item, index in zip(seq, xrange(sys.maxint)): marker = f(item) already_seen.setdefault(marker, []).append((index, item)) auxlist = [p(candidates) for candidates in already_seen.values( )] auxlist.sort( ) return [item for index, item in auxlist]
Recipe 17.4 is applicable only if you do not care about item ordering or, in other words, if the sequences involved are meaningful only as sets of items, which is often the case. When sequential order is significant, a different approach is needed.
If the items are hashable, it’s not hard to maintain
sequence order, keeping only the first occurrence of each value. If
the items are not hashable, but are of types supported by
cPickle.dumps
,
it might be worth using this function for long-enough sequences.
Another possibility suggested by this approach is to handle
uniqueness within equivalence classes. In other words, have the
uniqueness function accept as an argument a function
f
that must return hashable objects, such that
f(x)==f(y)
if and only if items
x
and y
are equivalent.
Identity (in the mathematical sense, not in the Python sense) is used
as the default if no argument f
is supplied, but
the caller can pass cPickle.dumps
or whatever
other equivalence-defining function is appropriate. This approach is
shown in the uniquer
function in the solution.
If you need to keep the last occurring rather than the earliest
occurrence of an item in each equivalence class, a different approach
may be appropriate, as shown in the uniquest
function in the solution. In this case, we do one pass through the
input sequence, associating the latest index in it to each
equivalence class, then sort those indexes to reconstruct the
ordering for the output sequence.
However, the sort degrades performance to
O(N1x
log(N1)), in which
N1 is the number of unique items. To keep the
last occurring with O(N) performance,
it’s simplest to reverse
the
input sequence (or a copy thereof into a local list, since the input
sequence might be immutable) and reverse
the
result, as shown in uniquique
. An alternative
approach, shown in uniquoque
, is to build and
maintain a list of flags parallel to seq
, in which
each flag is true if and only if the corresponding item must be part
of the output sequence. Then we can use a list comprehension (or a
loop) to build the output in a separate second pass. Each of these
general idioms has many uses and is worth keeping in mind as a
worthwhile sequence-processing technique.
But coming back to uniquest
, it’s
interesting to notice that it easily generalizes to cases in which
the choice among multiple items in the same equivalence class depends
on an arbitrary precedence function p
that
considers both the actual items and their indexes of occurrence. As
long as function p
can operate pairwise, you only
need to replace the simple assignment used in
uniquest
:
already_seen[marker] = index, item
with a call to the precedence function, which returns the
(index, item)
pair for the chosen occurrence among
the two. Precedence functions that need to examine the whole set of
equivalent items to make their choice can also be accommodated, of
course, but you need to build the set in one pass and perform only
the selections when that pass is finished. These fancy approaches are
clearly only useful for substantial equivalence functions (not for
identity, nor for functions meant to act as proxies for identity,
such as cPickle.dumps
), so f
defaulting to the identity function has been removed from the
fancy_unique
and
fancier_uniquer
functions, which show these
(perhaps overgeneralized) approaches.
An example of fancy_unique
may help. Say
we’re given a list of words, and we need to get a
sublist from it, respecting order, such that no two words on the
sublist begin with the same letter. Out of all the words in the
original list that begin with each given letter, we need to keep the
longest word and, in case of equal lengths, the word appearing later
on the list. This sounds complicated, but with
fancy_unique
to help us, it’s
really not that bad:
def complicated_choice(words): def first_letter(aword): return aword[0].lower( ) def prefer((indx1, word1), (indx2, word2)): if len(word2) > len(word1): return indx2, word2 else: return indx1, word1 return fancy_unique(words, first_letter, prefer)
The prefer
function is simplified, because it
knows fancy_unique
always calls it with
indx2<indx1
. So the older indx2, word2
pair must be returned only when
word2
is longer than word1
;
otherwise, indx1, word1
is always the proper
result. The automatic tuple unpacking in
prefer
’s signature is debatable,
stylewise, but I personally like it (it reminds me of Haskell).
Out of all the general programming techniques presented in the various functions of this recipe, that of writing higher-order functions, which organize a computation and appropriately call back to functions they receive as arguments, is easily the most precious. This is well worth keeping in mind in several circumstances, and not just for old Haskell-heads, as it often works great in Python.
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.