Deriving a clean stream of terms from the documents

This recipe consumes the URL stream, downloading the document content and deriving a clean stream of terms that are suitable for later analysis. A clean term is defined as a word that:

  • Is not a stop word
  • Is a valid dictionary word
  • Is not a number or URL
  • Is a lemma

A lemma is the canonical form of a word; for example, run, runs, ran, and running are forms of the same lexeme with "run" as the lemma. Lexeme, in this context, refers to the set of all the forms that have the same meaning, and lemma refers to the particular form that is chosen by convention to represent the lexeme.

The lemma is important for this recipe because it enables us to group terms that have the same meaning. Where their frequency ...

Get Storm Real-time Processing Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.