Chapter 3. Test-Driven Development

Example 5: TF-IDF Implementation

In the previous example, we looked at extending pipe assemblies in Cascading workflows. Functionally, Example 4: Replicated Joins is only a few changes away from implementing an algorithm called term frequency–inverse document frequency (TF-IDF). This is the basis for many search indexing metrics, such as in the popular open source search engine Apache Lucene. See the Similarity class in Lucene for a great discussion of the algorithm and its use.

For this example, let’s show how to implement TF-IDF in Cascading—which is a useful subassembly to reuse in a variety of apps. Figure 3-1 shows a conceptual diagram for this. Based on having a more complex app to work with, we’ll begin to examine Cascading features for testing at scale.

Conceptual flow diagram for

Figure 3-1. Conceptual flow diagram for Example 5: TF-IDF Implementation

Starting from the source code directory that you cloned in Git, connect into the part5 subdirectory. First let’s add another sink tap to write the TF-IDF weights:

String tfidfPath = args[ 3 ];
Tap tfidfTap = new Hfs( new TextDelimited( true, "\t" ), tfidfPath );

Next we’ll modify the existing pipe assemblies for Word Count, beginning immediately after the join used as a “stop words” filter. We add the following line to retain only the doc_id and token fields in the output tuple stream, based on the fieldSelector ...

Get Enterprise Data Workflows with Cascading now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.