Chapter 6. Hadoop Ecosystem Integration
The previous chapters described the various use cases where Sqoop enables highly efficient data transfers between Hadoop and relational databases. This chapter will focus on integrating Sqoop with the rest of the Hadoop ecosystem: we will show you how to run Sqoop from within a specialized Hadoop scheduler named Oozie and how to load your data into Hadoop’s data warehouse system, Apache Hive, and Hadoop’s database, Apache HBase.
Scheduling Sqoop Jobs with Oozie
Problem
You are using Oozie in your environment to schedule Hadoop jobs and would like to call Sqoop from within your existing workflows.
Solution
Oozie includes special Sqoop actions that you can use to call Sqoop in your workflow. For example:
<workflow-appname="sqoop-workflow"xmlns="uri:oozie:workflow:0.1">...<actionname="sqoop-action"><sqoopxmlns="uri:oozie:sqoop-action:0.2"><job-tracker>foo:8021</job-tracker><name-node>bar:8020</name-node><command>import --table cities --connect ...</command></sqoop><okto="next"/><errorto="error"/></action>...</workflow-app>
Discussion
Starting from version 3.2.0, Oozie has built-in support for Sqoop. You can use the special action type in the same way you would execute a MapReduce action. You have two options for specifying Sqoop parameters. The first option is to use one tag, <command>, to list all the parameters, for example:
<command>import --table cities --username sqoop --password sqoop ...</command>
In this case, Oozie will take ...