O'Reilly logo

Enterprise Data Workflows with Cascading by Paco Nathan

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 3. Test-Driven Development

Example 5: TF-IDF Implementation

In the previous example, we looked at extending pipe assemblies in Cascading workflows. Functionally, Example 4: Replicated Joins is only a few changes away from implementing an algorithm called term frequency–inverse document frequency (TF-IDF). This is the basis for many search indexing metrics, such as in the popular open source search engine Apache Lucene. See the Similarity class in Lucene for a great discussion of the algorithm and its use.

For this example, let’s show how to implement TF-IDF in Cascading—which is a useful subassembly to reuse in a variety of apps. Figure 3-1 shows a conceptual diagram for this. Based on having a more complex app to work with, we’ll begin to examine Cascading features for testing at scale.

Conceptual flow diagram for
Figure 3-1. Conceptual flow diagram for Example 5: TF-IDF Implementation

Starting from the source code directory that you cloned in Git, connect into the part5 subdirectory. First let’s add another sink tap to write the TF-IDF weights:

String tfidfPath = args[ 3 ];
Tap tfidfTap = new Hfs( new TextDelimited( true, "\t" ), tfidfPath );

Next we’ll modify the existing pipe assemblies for Word Count, beginning immediately after the join used as a “stop words” filter. We add the following line to retain only the doc_id and token fields in the output tuple stream, based on the fieldSelector parameter: ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required