Processing Selected Pairs of Structured Data Efficiently
Credit: Alex Martelli, David Ascher
Problem
You need to efficiently process pairs of data from two large and related data sets.
Solution
Use an auxiliary
dictionary to do
preprocessing of the data, thereby reducing the need for iteration
over mostly irrelevant data. For instance, if xs
and ys are the two data sets, with matching keys
as the first item in each entry, so that x[0] == y[0] defines an
“interesting” pair:
auxdict = {}
for y in ys: auxdict.setdefault(y[0], []).append(y)
result = [ process(x, y) for x in xs for y in auxdict[x[0]] ]Discussion
To make the problem more concrete, let’s look at an
example. Say you need to analyze data about visitors to a web site
who have purchased something online. This means you need to perform
some computation based on data from two log files—one from the
web server and one from the credit-card processing framework. Each
log file is huge, but only a small number of the web server log
entries correspond to credit-card log entries. Let’s
assume that cclog is a sequence of records, one
for each credit-card transaction, and that weblog
is a sequence of records describing each web site hit.
Let’s further assume that each record uses the
attribute ipaddress to refer to the IP address
involved in each event. In this case, a reasonable first approach
would be to do something like:
results = [ process(webhit, ccinfo) for webhit in weblog for ccinfo in cclog \ if ccinfo.ipaddress==webhit.ipaddress ...