Chapter 6. Data Transfer

Data transfer deals with three important questions:

  • How do you get data into a Hadoop cluster?

  • How do you get data out of a Hadoop cluster?

  • How do you move data from one Hadoop cluster to another Hadoop cluster?

In general, Hadoop is not a transactional engine, where data is loaded in small, discrete, related bits of information like it would be in an airline reservation system. Instead, data is bulk loaded from external sources such a flat files for sensors, bulk loads from sources like http://www.data.gov for U.S. federal government data or log files, or transfers from relational systems.

The Hadoop ecosystem contains a variety of great tools for working with your data. However, it’s rare for your data to start or end in Hadoop. It’s much more common to have a workflow that starts with data from external systems, such as logs from your web servers, and ends with analytics hosted on a business intelligence (BI) system.

Data transfer tools help move data between those systems. More specifically, data transfer tools provide three basic capabilities:

File transfer

Tools like Flume (described here) and DistCp (described here) help move files and flat text, such as long entries, into your Hadoop cluster.

Database transfer

Tools like Sqoop (described next) provide a simple mechanism for moving data between traditional relational databases, such as Oracle or SQL Server, and your Hadoop cluster.

Data triage

Tools like Storm (described here

Get Field Guide to Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.