Chapter 3. Test-Driven Development
Example 5: TF-IDF Implementation
In the previous example, we looked at extending pipe assemblies in Cascading workflows.
Functionally, Example 4: Replicated Joins is only a few changes away from implementing an algorithm called
term frequency–inverse document frequency (TF-IDF).
This is the basis for many search indexing metrics, such as in the popular open source search engine
Apache Lucene.
See the
Similarity
class
in Lucene for a great discussion of the algorithm and its use.
For this example, let’s show how to implement TF-IDF in Cascading—which is a useful subassembly to reuse in a variety of apps. Figure 3-1 shows a conceptual diagram for this. Based on having a more complex app to work with, we’ll begin to examine Cascading features for testing at scale.
Figure 3-1. Conceptual flow diagram for Example 5: TF-IDF Implementation
Starting from the source code directory that you cloned in Git, connect into the part5 subdirectory. First let’s add another sink tap to write the TF-IDF weights:
String
tfidfPath
=
args
[
3
];
Tap
tfidfTap
=
new
Hfs
(
new
TextDelimited
(
true
,
"\t"
),
tfidfPath
);
Next we’ll modify the existing pipe assemblies for Word Count, beginning immediately after the join used as a “stop words” filter.
We add the following line to retain only the doc_id
and token
fields in the output tuple stream, based on the fieldSelector
parameter: ...
Get Enterprise Data Workflows with Cascading now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.