Chapter 4. Finding Things
Computers can compute, but that’s not what people use them for, mostly. Mostly, computers store and retrieve information. Retrieve implies find, and in the time since the advent of the Web, search has become a dominant application for people using computers.
As data volumes continue to grow—both absolutely, and relative to the number of people or computers or anything, really—search becomes an increasingly large part of the life of the programmer as well. A few applications lack the need to locate the right morsel in some information store, but very few.
The subject of search is one of the largest in computer science, and thus I won’t try to survey all of it or discuss the mechanics; in fact, I’ll only consider one simple search technique in depth. Instead, I’ll focus on the trade-offs that go into selecting search techniques, which can be subtle.
On Time
You really can’t talk about search without talking about time. There are two different flavors of time that apply to problems of search. The first is the time it takes the search to run, which is experienced by the user who may well be staring at a message saying something like “Loading…”. The second is the time invested by the programmer who builds the search function, and by the programmer’s management and customers waiting to use the program.
Problem: Weblog Data
Let’s look at a sample problem to get a feel for how a search works in real life. I have a directory containing logfiles from my weblog (http://www.tbray.org/ongoing) from early 2003 to late 2006; as of the writing of this chapter, they recorded 140,070,104 transactions and occupied 28,489,788,532 bytes (uncompressed). All these statistics, properly searched, can answer lots of questions about my traffic and readership.
Let’s look at a simple question first: which articles have been read the most? It may not be instantly obvious that this problem is about search, but it is. First of all, you have to search through the logfiles to find the lines that record someone fetching an article. Second, you have to search through those lines to find the name of the article they fetched. Third, you have to keep track, for each article, of how often it was fetched.
Here is an example of one line from one of these files, which wraps to fit the page in this book, but is a single long line in the file:
c80-216-32-218.cm-upc.chello.se - - [08/Oct/2006:06:37:48 -0700] "GET /ongoing/When/ 200x/2006/10/08/Grief-Lessons HTTP/1.1" 200 5945 "http://www.tbray.org/ongoing/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Reading from left to right, this tells us that:
Somebody from an organization named chello in Sweden, |
who provided neither a username nor a password, |
contacted my weblog early in the morning of October 8, 2006 (my server’s time zone is seven hours off Greenwich), |
and requested a resource named /ongoing/When/200x/2006/10/08/Grief-Lessons |
using the HTTP 1.1 protocol; |
the request was successful and returned 5,945 bytes; |
the visitor had been referred from my blog’s home page, |
and was using Internet Explorer 6 running on Windows XP. |
This is an example of the kind of line I want: one that records
the actual fetch of an article. There are lots of other lines that
record fetching stylesheets, scripts, pictures, and so on, and attacks
by malicious users. You can spot the kind of line I want by the fact
that the article’s name starts with /ongoing/When/
and continues with elements for
the decade, year, month, and day.
Our first step, then, should be to find lines that contain something like:
/ongoing/When/200x/2006/10/08/
Whatever language you’re programming in, you could spend lots of time writing code to match this pattern character by character. Or you could apply regular expressions.
Regular Expressions
Regular expressions are special languages designed specifically for matching patterns in text. If you learn how to use them well, you’ll save yourself immense amounts of time and irritation. I’ve never met a really accomplished programmer who wasn’t a master of regular expressions (often called regexps for short). Chapter 1, by Brian Kernighan, is dedicated to the beauty of regular expressions.
Because the filenames on my web site match such a strict, date-based pattern, a very straightforward regular expression can find the logfile lines I’m interested in. Other sites’ logfiles might require a more elaborate one. Here it is:
"GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+ "
A glance at this line instantly reveals one of the problems with regular expressions; they’re not the world’s most readable text. Some people might challenge their appearance in a book called Beautiful Code. Let’s put that issue aside for a moment and look at this particular expression. The only thing you need to know is that in this particular flavor of regular expression:
\d
Means “match any digit, 0 through 9”
- [^ .]
Means “match any character that’s not a space or period”[1]
- +
Means “match one or more instances of whatever came just before the +”
That [^ .]+, then, means that the last slash has to be followed by a bunch of nonspace and nonperiod characters. There’s a space after the + sign, so the regular expression stops when that space is found.
This regular expression won’t match a line where the filename
contains a period. So it will match Grief-Lessons
, the example I showed earlier
from my logfile, but not IMG0038.jpg
.
Putting Regular Expressions to Work
A regular expression standing by itself, as shown above, can be used on the command line to search files. But it turns out that most modern computer languages allow you to use them directly in program code. Let’s do that, and write a program that prints out only the lines that match the expression, which is to say a program that records all the times someone fetched an article from the weblog.
This example (and most other examples in this chapter) is in the Ruby programming language because I believe it to be, while far from perfect, the most readable of languages.
If you don’t know Ruby, learning it will probably make you a better programmer. In Chapter 29, the creator of Ruby, Yukihiro Matsumoto (generally known as “Matz”), discusses some of the design choices that have attracted me and so many other programmers to the language.
Example 4-1 shows our first Ruby program, with added line numbers on the left side. (All the examples in this chapter are available from the O’Reilly web site for this book.)
1 ARGF.each_line do |line| 2 if line =~ %r{GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+ } 3 puts line 4 end 5 end
Running this program prints out a bunch of logfile lines that look like the first example. Let’s have a line-by-line look at it:
- Line 1
We want to read all the lines of the input, and we don’t care whether they’re from files named on the command line or are being piped in from another program on the standard input. The designers of Ruby believe strongly that programmers shouldn’t have to write ugly code to deal with common situations, and this is a common situation. So,
ARGF
is a special variable that represents all the input sources. If the command line includes arguments, ARGF assumes they’re names of files and opens them one by one; if there aren’t any, it uses the standard input.each_line
is a method that you can call on pretty well any file-like object, such asARGF
. It reads the lines of input and passes them, one at a time, to a “block” of following code.The following
do
says that the block getting the input stretches from there to the correspondingend
, and the|line|
asks that theeach_line
method load each line into the variableline
before giving it to the block.This kind of loop may surprise the eyes of a new convert to Ruby, but it’s concise, powerful, and very easy to follow after just a bit of practice.
- Line 2
This is a pretty straightforward if statement. The only magic is the =~, which means “matches” and expects to be followed by regular expression. You can tell Ruby that something is a regular expression by putting slashes before and after it—for example,
/this-is-a-regex/
. But the particular regular expression we want to use is full of slashes. So to use the slash syntax, you’d have to “escape” them by turning each / into \/, which would be ugly. In this case, therefore, the %r trick produces more beautiful code.- Line 3
We’re inside the
if
block now. So, if the currentline
matches the regexp, the program executesputs line
, which prints out the line and a line feed.- Lines 4 and 5
That’s about all there is to it. The first
end
terminates theif
, and the second terminates thedo
. They look kind of silly dangling off the bottom of the code, and the designers of Python have figured out a way to leave them out, which leads to some Python code being more beautiful than the corresponding Ruby.
So far, we’ve shown how regular expressions can be used to find the lines in the logfile that we’re interested in. But what we’re really interested in is counting the fetches for each article. The first step is to identify the article names. Example 4-2 is a slight variation on the previous program.
1 ARGF.each_line do |line| 2 if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) } 3 puts $1 4 end 5 end
The differences are subtle. In line 2, I’ve added a pair of
parentheses (in boldface) around the interesting part of the article
name in the regular expression. In line 3, instead of printing out the
whole value of line
, I print out
$1
, which in Ruby (and several
other regular-expression-friendly languages) means “the first place in
the regular expression marked with parentheses.” You can mark lots of
different pieces of the expression, and thus use $2, $3
, and so on.
The first few lines of output produced by running this program over some logfile data look like this:
2003/10/10/FooCampMacs 2006/11/13/Rough-Mix 2003/05/22/StudentLookup 2003/11/13/FlyToYokohama 2003/07/31/PerlAngst 2003/05/21/RDFNet 2003/02/23/Democracy 2005/12/30/Spolsky-Recursion 2004/05/08/Torture 2004/04/27/RSSticker
Before we go to work determining the popularity of different articles, I’d like to argue that in some important ways, this code is beautiful. Take a moment and think of the code you’d have to write to look at an arbitrary chunk of text and do the same matching and selection work done by the parenthesized regexp. It would be quite a few lines of code, and it would be easy to get wrong. Furthermore, if the format of the logfile changed, fixing the pattern matcher would be error-prone and irritating.
Under the covers, the way that regular expressions work is also among the more wonderful things in computer science. It turns out that they can conveniently be translated into finite automata. These automata are mathematically elegant, and there are astoundingly efficient algorithms for matching them against the text you’re searching. The great thing is that when you’re running an automaton, you have to look only once at each character in the text you’re trying to match. The effect is that a well-built regular expression engine can do pattern matching and selection faster than almost any custom code, even if it were written in hand-optimized assembly language. That’s beautiful.
I think that the Ruby code is pretty attractive, too. Nearly every
character of the program is doing useful work. Note that there are no
semicolons on the ends of the lines, nor parentheses around the
conditional block, and that you can write puts line
instead of puts(line)
. Also, variables aren’t
declared—they’re just used. This kind of stripped-down language design makes for programs that
are shorter and easier to write, as well as (more important) easier to
read and easier to understand.
Thinking in terms of time, regular expressions are a win/win. It takes the programmer way less time to write them than the equivalent code, it takes less time to deliver the program to the people waiting for it, it uses the computer really efficiently, and the program’s user spends less time sitting there bored.
Content-Addressable Storage
Now we’re approaching the core of our problem, computing the popularity of articles. We’ll have to pull the article name out of each line, look it up to see how many times it’s been fetched, add one to that number, and then store it away again.
This may be the most basic of search patterns: we start with a key (what we’re using to search—in this case, an article name), and we’re looking for a value (what we want to find—in this case, the number of times the article has been fetched). Here are some other examples:
Key | Value |
Word | List of web pages containing the word |
Employee number | Employee’s personnel record |
Passport number | “true” or “false,” indicating whether the person with that passport should be subject to extra scrutiny |
What programmers really want in this situation is a very old idea in computer science: content-addressable memory, also known as an associative store and various other permutations of those words. The idea is to put the key in and get the value out. There actually exists hardware which does just that; it mostly lives deep in the bowels of microprocessors, providing rapid access to page tables and memory caches.
The good news is that you, the programmer, using any modern computer language, have access to excellent software implementations of associative memory. Different languages call these implementations different things. Often they are implemented as hash tables; in Java, Perl, and Ruby, which use this technique, they are called Hashes, HashMaps, or something similar. In Python, they are called dictionaries, and in the computer algebra language Maple, simply tables.
Now if you’re an eager search-algorithm fan just itching to write your own super-efficient search, this may sound like bad news, not good news. But think about those flavors of time; if you use the built-in associative store, the amount of programmer time and management invested in writing search algorithms goes to nearly zero.
By writing your own search, you might be able to save a little computer (and thus end-user) time, compared to the built-in version, but on the other hand, you might not; the people who write these things tend to be pretty clever. Andrew Kuchling has written Chapter 18 of this book on one such effort.
Associative stores are so important that dynamically typed languages such as Ruby and Python have not only built-in support, but special syntax for defining and using them. Let’s use Ruby’s hashes to count article popularity in Example 4-3.
1 counts = {} 2 counts.default = 0 3 4 ARGF.each_line do |line| 5 if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) } 6 counts[$1] += 1 7 end 8 end
This program isn’t that much different from the version in Example 4-2. Line 1 creates an empty Hash
called counts
. Line 2 gives the
array a “default value” of zero; hold on for an explanation of
that.
Then, in line 6, instead of printing out the article name, the
name serves as the key to look up the number of fetches of this
article seen so far in counts
, add
one to it, and store the value.
Now, consider what happens when the program sees some article
name stored in $1
for the first
time. I could write code along the lines of “if there is a counts[$1]
, then add one to it; otherwise,
set counts[$1]
to one.” The
designers of Ruby hate that kind of awkwardness; this is why they
provided the notion of a “default value” for a Hash. If you look up a
key the Hash doesn’t know about, it says “OK, zero,” allowing you to
write counts[$1]+=1
and have it
always just work.
I originally stated the problem as “Which of my articles have been read the most?” That’s kind of fuzzy; let’s interpret it to mean “Print out the top 10 most popular articles.” The resulting program is shown in Example 4-4.
1 counts = {} 2 counts.default = 0 3 4 ARGF.each_line do |line| 5 if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) } 6 counts[$1] += 1 7 end 8 end 9 10 keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] } 11 keys_by_count[0 .. 9].each do |key| 12 puts "#{counts[key]}: #{key}" 13 end
Line 10 looks a little less beautiful to me than most Ruby code, but it’s easy enough to understand. The keys
method of counts
returns an array containing all of
the Hash’s keys. Because of the hash implementation, the keys are
stored in no predictable order, and are also returned by the keys
method in random order. So, I have to
sort them and store them back in a new array.
In Ruby, sort
is accompanied
by a code block, here enclosed in curly braces. (In Ruby, you can
delimit a block either with do
and
end
or with { and }.) The sort
works its way back and forth through the array being sorted, passing
pairs of elements to the block, which has to return a negative number,
0, or a positive number depending on whether the first element is less
than, equal to, or greater than the second.
In this case, we want to get the data out of the hash in an
order defined by the values (the counts themselves) rather than by the
filenames (the keys), so we have to sort the keys by their values.
Have a close look at the code, and you’ll see how it works. Because
this is something that people do all the time, I’m surprised that
Ruby’s Hash doesn’t come with sort_by_value
.
We use a decreasing order for the sort so that, no matter how
many articles we’ve found, we know the first 10 items in keys_by_count
represent the top 10 articles
in popularity.
Now that we have an array of keys (article names) sorted in
descending order of how many times they’ve been fetched, we can
accomplish our assignment by printing out the first 10. Line 11 is
simple, but a word is in order about that each
method. In Ruby, you almost never see a
for
statement because anything
whose elements you might want to loop through has an each
method that does it for you.
Line 12 may be a little hard to read for the non-Rubyist because of the #{} syntax, but it’s pretty straightforward.
So, let’s declare victory on our first assignment. It took us only 13 lines of easy-to-read code. A seasoned Rubyist would have squeezed the last three lines into one.
Let’s run this thing and see what it reports. Instead of running it over the whole 28 GB, let’s just use it on a week’s data: a mere 1.2 million records comprising 245 MB.
~/dev/bc/ 548>zcat ~/ongoing/logs/2006-12-17.log.gz | \
time ruby code/report-counts.rb
4765: 2006/12/11/Mac-Crash
3138: 2006/01/31/Data-Protection
1865: 2006/12/10/EMail
1650: 2006/03/30/Teacup
1645: 2006/12/11/Java
1100: 2006/07/28/Open-Data
900: 2006/11/27/Choose-Relax
705: 2003/09/18/NXML
692: 2006/07/03/July-1-Fireworks
673: 2006/12/13/Blog-PR
13.54 real 7.49 user 0.73 sys
This run took place on my 1.67 GHz Apple PowerBook. The results are unsurprising, but the program does seem kind of slow. Should we worry about performance?
Time to Optimize?
I was wondering whether my sample run was really unreasonably slow, so I pulled together a very similar program in Perl, a language that is less beautiful than Ruby but is extremely fast. Sure enough, the Perl version took half the time. So, should we try to optimize?
We need to think about time again. Yes, we might be able to make this run faster, and thus reduce the program execution time and the time a user spends waiting for it, but to do this we’d have to burn some of the programmer’s time, and thus the time the user waits for the programmer to get the program written. In most cases, my instinct would be that 13.54 seconds to process a week’s data is OK, so I’d declare victory. But let’s suppose we’re starting to get gripes from people who use the program, and we’d like to make it run faster.
Glancing over Example 4-4, we can see that the program falls into two distinct parts. First, it reads all the lines and tabulates the fetches; then it sorts them to find the top 10.
There’s an obvious optimization opportunity here: why bother sorting all the fetch tallies when all we really want to do is pick the top 10? It’s easy enough to write a little code to run through the array once and pick the 10 highest elements.
Would that help? I found out by instrumenting the program to find out how much time it spent doing its two tasks. The answer was (averaging over a few runs) 7.36 seconds in the first part and 0.07 in the second. Which is to say, “No, it wouldn’t help.”
Might it be worthwhile to try to optimize the first part? Probably not; all it does is match regular expressions, and store and retrieve data using a Hash, and these are among the most heavily optimized parts of Ruby.
So, getting fancy in replacing that sort would probably waste the time of the programmer and the customer waiting for the code, without saving any noticeable amount of computer or waiting-user time. Also, experience would teach that you’re not apt to go much faster than Perl does for this kind of task, so the amount of speedup you’re going to get is pretty well bounded.
We’ve just finished writing a program that does something useful and turns out to be all about search. But we haven’t come anywhere near actually writing any search algorithms. So, let’s do that.
Problem: Who Fetched What, When?
Running a couple of quick scripts over the logfile data reveals that there are 12,600,064 instances of an article fetch coming from 2,345,571 different hosts. Suppose we are interested in who was fetching what, and when? An auditor, a police officer, or a marketing professional might be interested.
So, here’s the problem: given a hostname, report what articles were fetched from that host, and when. The result is a list; if the list is empty, no articles were fetched.
We’ve already seen that a language’s built-in hash or equivalent data structure gives the programmer a quick and easy way to store and look up key/value pairs. So, you might ask, why not use it?
That’s an excellent question, and we should give the idea a try. There are reasons to worry that it might not work very well, so in the back of our minds, we should be thinking of a Plan B. As you may recall if you’ve ever studied hash tables, in order to go fast, they need to have a small load factor; in other words, they need to be mostly empty. However, a hash table that holds 2.35 million entries and is still mostly empty is going to require the use of a whole lot of memory.
To simplify things, I wrote a program that ran over all the logfiles and pulled out all the article fetches into a simple file; each line has the hostname, the time of the transaction, and the article name. Here are the first few lines:
crawl-66-249-72-77.googlebot.com 1166406026 2003/04/08/Riffs egspd42470.ask.com 1166406027 2006/05/03/MARS-T-Shirt 84.7.249.205 1166406040 2003/03/27/Scanner
(The second field, the 10-digit number, is the standard Unix/Linux representation of time as the number of seconds since the beginning of 1970.)
Then I wrote a simple program to read this file and load a great big hash. Example 4-5 shows the program.
1 class BigHash 2 3 def initialize(file) 4 @hash = {} 5 lines = 0 6 File.open(file).each_line do |line| 7 s = line.split 8 article = s[2].intern 9 if @hash[s[0]] 10 @hash[s[0]] << [ s[1], article ] 11 else 12 @hash[s[0]] = [ s[1], article ] 13 end 14 lines += 1 15 STDERR.puts "Line: #{lines}" if (lines % 100000) == 0 16 end 17 end 18 19 def find(key) 20 @hash[key] 21 end 22 23 end
The program should be fairly self-explanatory, but line 15 is worth a note. When you’re running a big program that’s going to take a lot of time, it’s very disturbing when it works away silently, maybe for hours. What if something’s wrong? What if it’s going incredibly slow and will never finish? So, line 15 prints out a progress report after every 100,000 lines of input, which is reassuring.
Running this program was interesting. It took about 55 minutes of CPU time to load up the hash, and the program grew to occupy 1.56 GB of memory. A little calculation suggests that it costs around 680 bytes to store the information for each host, or slicing the data another way, about 126 bytes per fetch. This is a little scary, but probably reasonable for a hash table.
Retrieval performance was excellent. I ran 2,000 queries, half of which were randomly selected hosts from the log and thus succeeded, while the other half were those same hostnames reversed, none of which succeeded. The 2,000 queries completed in an average of about .02 seconds, so Ruby’s hash implementation can look up records in a hash containing 12 million or so records thousands of times per second.
Those 55 minutes to load up the data are troubling, but there are some tricks to address that. You could, for example, load it up once, then serialize the hash out and read it back in. And I didn’t try particularly hard to optimize the program.
The program was easy and quick to write, and it runs fast once it’s initialized, so its performance is good both in terms of waiting-for-the-program time and waiting-for-the-programmer time. Still, I’m unsatisfied. I have the feeling that there ought to be a way to get this kind of performance while burning less memory, less startup time, or both. It involves writing our own search code, though.
Binary Search
Nobody gets a Computer Science degree without studying a wide variety of search algorithms: trees, heaps, hashes, lists, and more. My favorite among all these is binary search. Let’s try it on the who-fetched-what-when problem and then look at what makes it beautiful.
My first attempt at putting binary search to use was quite disappointing; while the data took 10 minutes less to load, it required almost 100 MB more memory than with the hash. Clearly, there are some surprising things about the Ruby array implementation. The search also ran several times slower (but still in the range of thousands per second), but this is not surprising at all because the algorithm is running in Ruby code rather than with the underlying hardcoded hash implementation.
The problem is that in Ruby everything is an object, and arrays are fairly abstracted things with lots of built-in magic. So, let’s reimplement the program in Java, in which integers are just integers, and arrays come with very few extras.[2]
Nothing could be simpler, conceptually, than binary search. You divide your search space in two and see whether you should be looking in the top or bottom half; then you repeat the exercise until done. Instructively, there are a great many ways to code this algorithm incorrectly, and several widely published versions contain bugs. The implementation mentioned in “On the Goodness of Binary Search,” and shown in Java in Binary Search, is based on one I learned from Gaston Gonnet, the lead developer of the Maple language for symbolic mathematics and currently Professor of Computer Science at ETH in Zürich.
1 package binary; 2 3 public class Finder { 4 public static int find(String[] keys, String target) { 5 int high = keys.length; 6 int low = -1; 7 while (high - low > 1) { 8 int probe = (low + high) >>> 1; 9 if (keys[probe].compareTo(target) > 0) 10 high = probe; 11 else 12 low = probe; 13 } 14 if (low == -1 || keys[low].compareTo(target) != 0) 15 return -1; 16 else 17 return low; 18 } 19 }
Key aspects of this program are as follows:
In lines 5–6, note that the
high
andlow
bounds are set one off the ends of the array, so neither are initially valid indices. This eliminates all sorts of corner cases.The loop that starts in line 7 runs until the high and low bounds are adjacent; there is no testing to see whether the target has been found. Think for a minute whether you agree with this choice; we’ll return to the question later.
The loop has two invariants.
low
is either –1 or points to something less than or equal to the target value.high
is either one off the top of the array or points to something strictly greater than the target value.Line 8 is particularly interesting. In an earlier version it read:
probe = (high + low) / 2;
but in June 2006, Java guru Josh Bloch showed how, in certain obscure circumstances, that code could lead to integer overflow (see http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html). It is sobering indeed that, many decades into the lifetime of computer science, we are still finding bugs in our core algorithms. (The issue is also discussed by Alberto Savoia in Chapter 7.)
At this point, Rubyists will point out that modern dynamic languages such as Ruby and Python take care of integer overflow for you, and thus don’t have this bug.
Because of the loop invariant, once I’m done with the loop, I just need to check
low
(lines 14–17). If it’s not –1, either it points to something that matches the target, or the target isn’t there.
The Java version took only six and a half minutes to load, and it ran successfully, using less than 1 GB of heap. Also, while it’s harder to measure CPU time in Java than in Ruby, there was no perceptible delay in running the same 2,000 searches.
Binary Search Trade-offs
Binary search has some very large advantages. First of all, its performance is
O
(log2
N
). People often don’t really grasp how
powerful this is. On a 32-bit computer, the biggest log2 you’ll ever
encounter is 32 (similarly, 64 on a 64-bit computer), and any
algorithm that competes in an upper bound of a few dozen steps will be
“good enough” for many real-world scenarios.
Second, the binary-search code is short and simple. Code that is short and simple is beautiful, for a bunch of reasons. Maybe the most important is that it’s easier to understand, and understanding code is harder than writing it. There are fewer places for bugs to hide. Also, compact code plays better with instruction sets, I-caches, and JIT compilers, and thus tends to run faster.
Third, once you’ve got that sorted array, you don’t need any more index structures; binary search is very space-efficient.
The big downside to binary search is that the data has to be kept in order in memory. There are some data sets for which this is impossible, but fewer than you might think. If you think you have too much data to fit in memory, check the price of RAM these days and make sure. Any search strategy that requires going to disk is going to be immensely more complex, and in many scenarios slower.
Suppose you need to update the data set; you might think that would rule out binary search because you have to update a huge, contiguous array in memory. But that turns out to be easier than you might think. In fact, your program’s memory is scattered randomly all over the computer’s physical RAM, with the operating system’s paging software making it look sequential; you can do the same kind of trick with your own data.
Some might argue that since a hash table is
O
(1
),
that has to be better than binary search’s
O
(log2
N
). In practice, the difference may not be
that significant; set up an experiment sometime and do some
measurements. Also, consider that hash tables, with the necessary
collision-resolution code, are considerably more complex to
implement.
I don’t want to be dogmatic, but in recent years, I’ve started to take the following approach to search problems:
Try to solve it using your language’s built-in hash tables.
Then try to solve it with binary search.
Only then should you reluctantly start to consider other more complex options.
Escaping the Loop
Some look at my binary-search algorithm and ask why the loop always runs to the end without checking whether it’s found the target. In fact, this is the correct behavior; the math is beyond the scope of this chapter, but with a little work, you should be able to get an intuitive feeling for it—and this is the kind of intuition I’ve observed in some of the great programmers I’ve worked with.
Let’s think about the progress of the loop. Suppose you have n elements in the array, where n is some really large number. The chance of finding the target the first time through is 1/n, a really small number. The next iteration (after you divide the search set in half) is 1/(n/2)—still small—and so on. In fact, the chance of hitting the target becomes significant only when you’re down to 10 or 20 elements, which is to say maybe the last four times through the loop. And in the case where the search fails (which is common in many applications), those extra tests are pure overhead.
You could do the math to figure out when the probability of
hitting the target approaches 50 percent, but qualitatively, ask
yourself: does it make sense to add extra complexity to each step of
an O
(log2
N
) algorithm when the chances are it will
save only a small number of steps at the end?
The take-away lesson is that binary search, done properly, is a
two-step process. First, write an efficient loop that positions your
low
and high
bounds properly, then add a simple
check to see whether you hit or missed.
Search in the Large
When most people think of search they think of web search, as offered by Yahoo!, Google, and their competitors. While ubiquitous web search is a new thing, the discipline of full-text search upon which it is based is not. Most of the seminal papers were written by Gerald Salton at Cornell as far back as the early 1960s. The basic techniques for indexing and searching large volumes of text have not changed dramatically since then. What has changed is how result ranking is done.[3]
Searching with Postings
The standard approach to full-text search is based on the notion of a posting, which is a small, fixed-size record. To build an index, you read all the documents and, for each word, create a posting that says word x appears in document y at position z. Then you sort all the words together, so that for each unique word you have a list of postings, each a pair of numbers consisting of a document ID and the text’s offset in that document.
Because postings are small and fixed in size, and because you tend to have a huge number of them, a natural approach is to use binary search. I have no idea of the details of how Google or Yahoo! do things, but I’d be really unsurprised to hear that those tens of thousands of computers spend a whole lot of their time binary-searching big arrays of postings.
People who are knowledgeable about search shared a collective snicker a few years ago when the number of documents Google advertised as searching, after having been stuck at two billion and change for some years, suddenly became much larger and then kept growing. Presumably they had switched the document ID in all those postings from 32-bit to 64-bit numbers.
Ranking Results
Given a word, searching a list of postings to figure out which documents contain it is not rocket science. A little thought shows that combining the lists to do AND and OR queries and phrase search is also simple, conceptually at least. What’s hard is sorting the result list so that the good results show up near the top. Computer science has a subdiscipline called Information Retrieval (IR for short) that focuses almost entirely on this problem. Historically, the results had been very poor, up until recently.
Searching the Web
Google and its competitors have been able to produce good results in the face of unimaginably huge data sets and populations of users. When I say “good,” I mean that high-quality results appear near the top of the result list, and that the result list appears quickly.
The promotion of high-quality results is a result of many factors, the most notable of which is what Google calls PageRank, based largely on link counting: pages with lots of hyperlinks pointing at them are deemed to be more popular and thus, by popular vote, winners.
In practice, this seems to work well. A couple of interesting observations follow. First, until the rise of PageRank, the leaders in the search-engine space were offerings such as Yahoo! and DMoz, which worked by categorizing results; so, the evidence seems to suggest that it’s more useful to know how popular something is than to know what it’s about.
Second, PageRank is applicable only to document collections that are richly populated with links back and forth between the documents. At the moment, two document collections qualify: the World Wide Web and the corpus of peer-reviewed academic publications (which have applied PageRank-like methods for decades).
The ability of large search engines to scale up with the size of data and number of users has been impressive. It is based on the massive application of parallelism: attacking big problems with large numbers of small computers, rather than a few big ones. One of the nice things about postings is that each posting is independent of all the others, so they naturally lend themselves to parallel approaches.
For example, an index based on doing binary search in arrays of postings is fairly straight-forward to partition. In an index containing only English words, you could easily create 26 partitions (the term used in the industry is shards), one for words beginning with each letter. Then you can make as many copies as you need of each shard. Then, a huge volume of word-search queries can be farmed out across an arbitrarily large collection of cooperating search nodes.
This leaves the problem of combining search results for multiword or phrase searches, and this requires some real innovation, but it’s easy to see how the basic word-search function could be parallelized.
This discussion is a little unfair in that it glosses over a huge number of important issues, notably including fighting the Internet miscreants who continually try to outsmart search-engine algorithms for commercial gain.
Conclusion
It is hard to imagine any computer application that does not involve storing data and finding it based on its content. The world’s single most popular computer application, web search, is a notable example.
This chapter has considered some of the issues, notably bypassing the traditional “database” domain and the world of search strategies that involve external storage. Whether operating at the level of a single line of text or billions of web documents, search is central. From the programmer’s point of view, it also needs to be said that implementing searches of one kind or another is, among other things, fun.
[1] People who have used regular expressions know that a period is a placeholder for “any character,” but it’s harder to remember that when a period is enclosed in square brackets, it loses the special meaning and refers to just a period.
[2] This discussion of binary search borrows heavily from my 2003 piece, "On the Goodness of Binary Search,” available online at http://www.tbray.org/ongoing/When/200x/2003/03/22/Binary.
[3] This discussion of full-text search borrows heavily from my 2003 series, On Search, available online at http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC. The series covers the topic of search quite broadly, including issues of user experience, quality control, natural language processing, intelligence, internationalization, and so on.
Get Beautiful Code now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.