Chapter 20. Hive Integration with Oozie
Apache Oozie is a workload scheduler for Hadoop: http://incubator.apache.org/oozie/.
You may have noticed Hive has its own internal workflow system. Hive converts a query into one or more stages, such as a map reduce stage or a move task stage. If a stage fails, Hive cleans up the process and reports the errors. If a stage succeeds, Hive executes subsequent stages until the entire job is done. Also, multiple Hive statements can be placed inside an HQL file and Hive will execute each query in sequence until the file is completely processed.
Hive’s system of workflow management is excellent for single jobs or
jobs that run one after the next. Some workflows need more than this. For
example, a user may want to have a process in which step one is a custom
MapReduce job, step two uses the output of step one and processes it using
Hive, and finally step three uses
to copy the output from step 2 to a remote cluster. These kinds of workflows
are candidates for management as Oozie Workflows.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. An important feature of Oozie is that the state of the workflow is detached from the client who launches the job. This detached (fire and forget) job launching is useful; normally a Hive job is attached to the console that submitted it. If that console dies, the job is half complete. ...