In the previous example, we looked at extending pipe assemblies in Cascading workflows.
Functionally, Example 4: Replicated Joins is only a few changes away from implementing an algorithm called
term frequency–inverse document frequency (TF-IDF).
This is the basis for many search indexing metrics, such as in the popular open source search engine
in Lucene for a great discussion of the algorithm and its use.
For this example, let’s show how to implement TF-IDF in Cascading—which is a useful subassembly to reuse in a variety of apps. Figure 3-1 shows a conceptual diagram for this. Based on having a more complex app to work with, we’ll begin to examine Cascading features for testing at scale.
Starting from the source code directory that you cloned in Git, connect into the part5 subdirectory. First let’s add another sink tap to write the TF-IDF weights:
Next we’ll modify the existing pipe assemblies for Word Count, beginning immediately after the join used as a “stop words” filter.
We add the following line to retain only the
token fields in the output tuple stream, based on the
fieldSelector parameter: ...