O'Reilly logo

Enterprise Data Workflows with Cascading by Paco Nathan

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 2. Extending Pipe Assemblies

Example 3: Customized Operations

Cascading provides a wide range of built-in operations to perform on workflows. For many apps, the Cascading API is more than sufficient. However, you may run into cases where a slightly different transformation is needed. Each of the Cascading operations can be extended by subclassing in Java. Let’s extend the Cascading app from Example 2: The Ubiquitous Word Count to show how to customize an operation.

Modifying a conceptual flow diagram is a good way to add new requirements for a Cascading app. Figure 2-1 shows how this iteration of Word Count can be modified to clean up the token stream. A new class for this example will go right after the Tokenize operation so that it can scrub each tuple. In terms of Cascading patterns, this operation needs to be used in an Each operator, so we must implement it as a Function.

Conceptual flow diagram for
Figure 2-1. Conceptual flow diagram for Example 3: Customized Operations

Starting from the source code directory that you cloned in Git, connect into the part3 subdirectory. We’ll define a new class called ScrubFunction as our custom operation, which subclasses from BaseOperation while implementing the Function interface:

public class ScrubFunction extends BaseOperation implements Function { ... }

Next, we need to define a constructor, which specifies how this function consumes from the tuple stream:

public ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required