Skip to Main Content
Data Algorithms
book

Data Algorithms

by Mahmoud Parsian
July 2015
Intermediate to advanced content levelIntermediate to advanced
778 pages
17h 9m
English
O'Reilly Media, Inc.
Content preview from Data Algorithms

Chapter 31. The Bloom Filter

This chapter introduces the concept of a Bloom filter and uses it in a reduce-side join, which engages the Bloom filter in the map phase of a MapReduce job. So what is a Bloom filter? How it can be used in a MapReduce environment? How can it speed up a join operation between two big relations/tables? According to Wikipedia:

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not....In other words, a query returns either “possibly in set” or “definitely not in set.” Elements can be added to the set, but not removed (though this can be addressed with a “counting” filter). The more elements that are added to the set, the larger the probability of false positives.

The Bloom filter data structure may return true for elements that are not actually members of the set (this is called a false-positive error), but it will never return false for elements that are in the set; for each element in the set, the Bloom filter must return true. There is a very nice tutorial on the Bloom filter by Bill Mill. If you’d like to explore this data structure in more depth, Jacob Honoroff has written a good introduction to the Bloom filter data structure.

Bloom Filter Properties

We can summarize Bloom filter properties as follows:

  • Given a big set S = {x1, x2, ..., xn}, the Bloom filter is a probabilistic, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms with Spark

Data Algorithms with Spark

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert
Data Mesh

Data Mesh

Zhamak Dehghani
Learning Algorithms

Learning Algorithms

George Heineman

Publisher Resources

ISBN: 9781491906170Errata PageSupplemental Content