Jaccard similarity for large sets with MinHash

The Bloom filter is a probabilistic data structure to determine whether an item is a member of a set. While comparing user or item similarities, what we are usually interested in is the intersection between sets, as opposed to their precise contents. MinHash is a technique that enables a large set to be compressed in such a way that we can still perform the Jaccard similarity on the compressed representations.

Let's see how it works with a reference to two of the most prolific raters in the MovieLens dataset. Users 405 and 655 have rated 727 and 675 movies respectively. In the following code, we extract their ratings and convert them into sets before passing to Incanter's jaccard-index function. Recall ...

Get Clojure for Data Science now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.