Chapter 26. Advanced Library Design: Building a Bloom Filter
Introducing the Bloom Filter
A Bloom filter is a set-like data structure that is highly efficient in its use of space. It supports two operations only: insertion and membership querying. Unlike a normal set data structure, a Bloom filter can give incorrect answers. If we query it to see whether an element that we have inserted is present, it will answer affirmatively. If we query for an element that we have not inserted, it might incorrectly claim that the element is present.
For many applications, a low rate of false positives is tolerable. For instance, the job of a network traffic shaper is to throttle bulk transfers (e.g., BitTorrent) so that interactive sessions (such as ssh sessions or games) see good response times. A traffic shaper might use a Bloom filter to determine whether a packet belonging to a particular session is bulk or interactive. If it misidentifies 1 in 10,000 bulk packets as interactive and fails to throttle it, nobody will notice.
The attraction of a Bloom filter is its space efficiency. If we want to build a spell checker and have a dictionary of 500,000 words, a set data structure might consume 20 megabytes of space. A Bloom filter, in contrast, would consume about half a megabyte, at the cost of missing perhaps 1% of misspelled words.
Behind the scenes, a Bloom filter is remarkably simple. It consists of a bit array and a handful of hash functions. We’ll use k for the number of hash functions. If ...
Get Real World Haskell now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.