Implementing TF-IDF in Hadoop
TF-IDF is a well-known problem in the MapReduce communities; it is well-documented and implemented, and it is interesting in that it is sufficiently complex to be useful and instructive at the same time. Cascading has a series of tutorials on TF-IDF at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/, which documents this implementation well. For this recipe, we shall use a Clojure Domain Specific Language (DSL) called Cascalog that is implemented on top of Cascading. Cascalog has been chosen because it provides a set of abstractions that are very semantically similar to the Trident API and are very terse while still remaining very readable and easy to understand.
Before you begin, ...