This chapter will guide you through the basic prerequisites of using Sqoop. You will learn how to download and install Sqoop on your computer or on any node of your Hadoop cluster. Sqoop comes with a very detailed User Guide describing all the available parameters and basic usage. Rather than repeating the guide, this book focuses on applying the parameters to real use cases and helping you to deploy and use Sqoop effectively in your environment.
Sqoop supports the Linux operating system, and there are several installation options. One option is the source tarball that is provided with every release. This tarball contains only the source code of the project. You can’t use it directly and will need to first compile the sources into binary executables. For your convenience, the Sqoop community provides a binary tarball for each major supported version of Hadoop along with the source tarball.
In addition to the tarballs, there are open source projects and commercial companies that provide operating system-specific packages. One such project, Apache Bigtop, provides
rpm packages for Red Hat, CentOS, SUSE, and
deb packages for Ubuntu and Debian. The biggest benefit of using packages over tarballs is their seamless integration with the operating system: for example, Configuration files are stored in
/etc/ and logs in
This book focuses on using Sqoop rather than developing for it. If you prefer to compile the source code from source tarball into binary directly, the Developer’s Guide is a good resource.
You can download the binary tarballs from the Apache Sqoop website. All binary tarballs contain a
.bin__hadoop string embedded in their name, followed by the Apache Hadoop major version that was used to generate them. For Hadoop 1.x, the archive name will include the string
.bin__hadoop-1.0.0. While the naming convention suggests this tarball only works with version 1.0.0, in fact, it’s fully compatible not only with the entire 1.0.x release branch but also with version 1.1.0. It’s very important to download the binary tarball created for your Hadoop version. Hadoop has changed internal interfaces between some of the major versions; therefore, using a Sqoop tarball that was compiled against Hadoop version 1.x with, say, Hadoop version 2.x, will not work.
To install Sqoop, download the binary tarball to any machine from which you want to run Sqoop and unzip the archive. You can directly use Sqoop from within the extracted directory without any additional steps. As Sqoop is not a cluster service, you do not need to install it on all the nodes in your cluster. Having the installation available on one single machine is sufficient. As a Hadoop application, Sqoop requires that the Hadoop libraries and configurations be available on the machine. Hadoop installation instructions can be found in the Hadoop project documentation. If you want to import your data into HBase and Hive, Sqoop will need those libraries. For common functionality, these dependencies are not mandatory.
Installing packages is simpler than using tarballs. They are already integrated with the operating system and will automatically download and install most of the required dependencies during the Sqoop installation. Due to licensing, the JDBC drivers won’t be installed automatically. For those instructions, check out the section Installing JDBC Drivers.
Bigtop provides repositories that can be easily added into your system in order to find and install the dependencies. Bigtop installation instructions can be found in the Bigtop project documentation. Once Bigtop is successfully deployed, installing Sqoop is very simple and can be done with the following commands:
To install Sqoop on a Red Hat, CentOS, or other
$sudo yum install sqoop
To install Sqoop on an Ubuntu, Debian, or other
$sudo apt-get install sqoop
- To install Sqoop on a SLES system:
$sudo zypper install sqoop
Sqoop’s main configuration file
sqoop-site.xml is available in the configuration directory (
conf/ when using the tarball or
/etc/sqoop/conf when using Bigtop packages). While you can further customize Sqoop, the defaults will suffice in a majority of cases. All available properties are documented in the
sqoop-site.xml file. We will explain the more commonly used properties in greater detail later in the book.
Sqoop requires the JDBC drivers for your specific database server (MySQL, Oracle, etc.) in order to transfer data. They are not bundled in the tarball or packages.
You need to download the JDBC drivers and then install them into Sqoop. JDBC drivers are usually available free of charge from the database vendors’ websites. Some enterprise data stores might bundle the driver with the installation itself. After you’ve obtained the driver, you need to copy the driver’s JAR file(s) into Sqoop’s
lib/ directory. If you’re using the Sqoop tarball, copy the JAR files directly into the
lib/ directory after unzipping the tarball. If you’re using packages, you will need to copy the driver files into the
JDBC is a Java specific database-vendor independent interface for accessing relational databases and enterprise data warehouses. Upon this generic interface, each database vendor must implement a compliant driver containing required functionality. Due to licensing, the Sqoop project can’t bundle the drivers in the distribution. You will need to download and install each driver individually.
Each database vendor has a slightly different method for retrieving the JDBC driver. Most of them make it available as a free download from their websites. Please contact your database administrator if you are not sure how to retrieve the driver.
Some database systems provide special connectors, which are not part of the Sqoop distribution, and these take advantage of advanced database features. If you want to take advantage of these optimizations, you will need to individually download and install those specialized connectors.
On the node running Sqoop, you can install the specialized connectors anywhere on the local filesystem. If you plan to run Sqoop from multiple nodes, you have to install the connector on all of those nodes. To be clear, you do not have to install the connector on all nodes in your cluster, as Sqoop will automatically propagate the appropriate JARs as needed throughout your cluster.
In addition to installing the connector JARs on the local filesystem, you also need to register them with Sqoop. First, create a directory
manager.d in the Sqoop configuration directory (if it does not exist already). The configuration directory might be in a different location, based on how you’ve installed Sqoop. With packages, it’s usually in the
/etc/sqoop directory, and with tarballs, it’s usually in the
conf/ directory. Then, inside this directory, you need to create a file (naming it after the connector is a recommended best practice) that contains the following line:
You can find the name of the fully qualified class in each connector’s documentation.
A significant strength of Sqoop is its ability to work with all major and minor database systems and enterprise data warehouses. To abstract the different behavior of each system, Sqoop introduced the concept of connectors: all database-specific operations are delegated from core Sqoop to the specialized connectors. Sqoop itself bundles many such connectors; you do not need to download anything extra in order to run Sqoop. The most general connector bundled with Sqoop is the Generic JDBC Connector that utilizes only the JDBC interface. This will work with every JDBC-compliant database system. In addition to this generic connector, Sqoop also ships with specialized connectors for MySQL, Oracle, PostgreSQL, Microsoft SQL Server, and DB2, which utilize special properties of each particular database system. You do not need to explicitly select the desired connector, as Sqoop will automatically do so based on your JDBC URL.
In addition to the built-in connectors, there are many specialized connectors available for download. Some of them are further described in this book. For example, OraOop is described in Faster Transfers with Oracle, and Cloudera Connector for Teradata is described in Using the Cloudera Teradata Connector. More advanced users can develop their own connectors by following the guidelines listed in the Sqoop Developer’s Guide.
Most, if not all, of the connectors depend on the underlying JDBC drivers in order to make the connection to the remote database server. It’s imperative to install both the specialized connector and the appropriate JDBC driver. It’s also important to distinguish the connector from the JDBC driver. The connector is a Sqoop-specific pluggable piece that is used to delegate some of the functionality that might be done faster when using database-specific tweaks. The JDBC driver is also a pluggable piece. However, it is independent of Sqoop and exposes database interfaces in a portable manner for all Java applications.
Sqoop always requires both the connector and the JDBC driver.
Sqoop is a command-line tool that can be called from any shell implementation such as
zsh. An example Sqoop command might look like the following (all parameters will be described later in the book):
The command-line interface has the following structure:
sqoop TOOL PROPERTY_ARGS SQOOP_ARGS
TOOL indicates the operation that you want to perform. The most important operations are
import for transferring data from a database to Hadoop and
export for transferring data from Hadoop to a database.
PROPERTY_ARGS are a special set of parameters that are entered as Java properties in the format
-Dname=value (examples appear later in the book). Property parameters are followed by
SQOOP_ARGS that contain all the various Sqoop parameters.
Mixing property and Sqoop parameters together is not allowed. Furthermore, all property parameters must precede all Sqoop parameters.
You can specify
EXTRA_ARGS for specialized connectors, which can be used to enter additional parameters specific to each connector. The
EXTRA_ARGS parameters must be separated from the
SQOOP_ARGS with a
Sqoop has a bewildering number of command-line parameters (more than 60). Type
sqoop help to retrieve the entire list. Type
sqoop help TOO (e.g.,
sqoop help import) to get detailed information for a specific tool.
You can ask for help from the Sqoop community via the mailing lists. The Sqoop Mailing Lists page contains general information and instructions for using the Sqoop User and Development mailing lists. Here is an outline of the general process:
- First, you need to subscribe to the User list at the Sqoop Mailing Lists page.
- To get the most out of the Sqoop mailing lists, you may want to read Eric Raymond’s How To Ask Questions The Smart Way.
- Then provide the full context of your problem with details on observed or desired behavior. If appropriate, include a minimal self-reproducing example so that others can reproduce the problem you’re facing.
- Finally, email your question to email@example.com.
Before sending email to the mailing list, it is useful to read the Sqoop documentation and search the Sqoop mailing list archives. Most likely your question has already been asked, in which case you’ll be able to get an immediate answer by searching the archives. If it seems that your question hasn’t been asked yet, send it to firstname.lastname@example.org.
If you aren’t already a list subscriber, your email submission will be rejected.
Your question might have to do with your Sqoop command causing an error or giving unexpected results. In the latter case, it is necessary to include enough data to reproduce the error. If the list readers can’t reproduce it, they can’t diagnose it. Including relevant information greatly increases the probability of getting a useful answer.
To that end, you’ll need to include the following information:
- Versions: Sqoop, Hadoop, OS, JDBC
Console log after running with the
Capture the entire output via
sqoop import … &> sqoop.log
- Capture the entire output via
Entire Sqoop command including the
- Expected output and actual output
- Table definition
Small input data set that triggers the problem
- Especially with export, malformed data is often the culprit
Hadoop task logs
- Often the task logs contain further information describing the problem
- Permissions on input files
While the project has several communication channels, the mailing lists are not only the most active but also the official channels for making decisions about the project itself. If you’re interested in learning more about or participating in the Apache Sqoop project, the mailing lists are the best way to do that.