Story mutation

We now have enough material to enter the heart of the subject. We were able to detect near-duplicate events and group similar articles within a story. In this section, we will be working in real time (on a Spark Streaming context), listening for news articles, grouping them into stories, but also looking at how these stories may change over time. We appreciate that the number of stories is undefined as we do not know in advance what events may arise in the coming days. As optimizing KMeans for each batch interval (15 mn in GDELT) would not be ideal, neither would it be efficient, we decided to take this constraint not as a limiting factor but really as an advantage in the detection of breaking news articles.

The Equilibrium state ...

Get Mastering Spark for Data Science now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.