Chapter 6. Orchestration

Workflow orchestration, sometimes referred to as workflow automation or business process automation, refers to the tasks of scheduling, coordinating, and managing workflows. Workflows are sequences of data processing actions. A system capable of performing orchestration is called an orchestration framework or workflow automation framework.

Workflow orchestration is an important but often neglected part of application architectures. It is especially important in Hadoop because many applications are developed as MapReduce jobs, in which you are very limited in what you can do within a single job. However, even as the Hadoop toolset evolves and more flexible processing frameworks like Spark gain prominence, there are benefits to breaking down complex processing workflows into reusable components and using an external engine to handle the details involved in stitching them together.

Why We Need Workflow Orchestration

Developing end-to-end applications with Hadoop usually involves several steps to process the data. You may want to use Sqoop to retrieve data from a relational database and import it to Hadoop, then run a MapReduce job to validate some data constraints and convert the data into a more suitable format. Then, you may execute a few Hive jobs to aggregate and analyze the data, or if the analysis is particularly involved, there may be additional MapReduce steps.

Each of these jobs can be referred to as an action. These actions have to be scheduled, ...

Get Hadoop Application Architectures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.