Chapter 2. Extending Pipe Assemblies

Example 3: Customized Operations

Cascading provides a wide range of built-in operations to perform on workflows. For many apps, the Cascading API is more than sufficient. However, you may run into cases where a slightly different transformation is needed. Each of the Cascading operations can be extended by subclassing in Java. Let’s extend the Cascading app from Example 2: The Ubiquitous Word Count to show how to customize an operation.

Modifying a conceptual flow diagram is a good way to add new requirements for a Cascading app. Figure 2-1 shows how this iteration of Word Count can be modified to clean up the token stream. A new class for this example will go right after the Tokenize operation so that it can scrub each tuple. In terms of Cascading patterns, this operation needs to be used in an Each operator, so we must implement it as a Function.

Conceptual flow diagram for
Figure 2-1. Conceptual flow diagram for Example 3: Customized Operations

Starting from the source code directory that you cloned in Git, connect into the part3 subdirectory. We’ll define a new class called ScrubFunction as our custom operation, which subclasses from BaseOperation while implementing the Function interface:

public class ScrubFunction extends BaseOperation implements Function { ... }

Next, we need to define a constructor, which specifies how this function consumes from the tuple stream:

public ...

Get Enterprise Data Workflows with Cascading now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.