Building stories

Simhash should be used to detect near-duplicate articles only. Extending our search to a 3-bit or 4-bit difference becomes terribly inefficient (3-bit difference requires 5,488 distinct queries to Cassandra while 41,448 queries will be needed to detect up to 4-bit differences) and seems to bring much more noise than related articles. Should the user want to build larger stories, a typical clustering technique must be applied then.

Building term frequency vectors

We will start grouping events into stories using a KMeans algorithm, taking the articles' word frequencies as input vectors. TF-IDF is simple, efficient, and a proven technique to build vectors out of text content. The basic idea is to compute a word frequency that we normalize ...

Get Mastering Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.