Chapter 7. The Workflow Abstraction

Key Insights

Thus far, we have looked at several examples of how to use Cascading. Now let’s step back a bit and take a look at some of the theory at its foundation.

The author of Cascading, Chris Wensel, was working at a large firm known well for many data products. Wensel was evaluating the Nutch project, which included Lucene and subsequently Hadoop—he was evaluating how to leverage these open source technologies for Big Data within an Enterprise environment. His takeaway was that it would be difficult to find enough Java developers who could write complex Enterprise apps directly in MapReduce.

An obvious response would have been to build some kind of abstraction layer atop Hadoop. Many different variations of this have been developed over the years, and that approach dates back to the many “fourth-generation languages” (4GL) starting in the 1970s. However, another takeaway Wensel had from the early days of Apache Hadoop use was that abstraction layers built by and for the early adopters typically would not pass the “bench test” for Enterprise. The operational complexity of large-scale apps and the need to leverage many existing software engineering practices would be difficult if not impossible to manage through a 4GL-styled abstraction layer.

A key insight into this problem was that MapReduce is based on the functional programming paradigm. In the original MapReduce paper by Jeffrey Dean and Sanjay Ghemawat at Google, the authors made clear that ...

Get Enterprise Data Workflows with Cascading now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.