Calculating the relative importance of each term

The true power of Trident is demonstrated in this recipe, with many of the abstractions used in order to calculate the TF-IDF value. Before the recipe is presented, it is important to understand the simple math behind TF-IDF. We will need the following components to calculate the TF-IDF:

  • tf(t,d): This component specifies the term frequency, that is, the number of times a given term (t) appears in a given document (d)
  • df(t): This component specifies the document frequency, that is, how frequently a given term (t) appears across all documents
  • D: This component specifies the document count, that is, the total number of documents

There are many ways to calculate the term frequency; for this recipe, we will ...

Get Storm Real-time Processing Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.