Chapter 6. Data Transfer
Data transfer deals with three important questions:
How do you get data into a Hadoop cluster?
How do you get data out of a Hadoop cluster?
How do you move data from one Hadoop cluster to another Hadoop cluster?
In general, Hadoop is not a transactional engine, where data is loaded in small, discrete, related bits of information like it would be in an airline reservation system. Instead, data is bulk loaded from external sources such a flat files for sensors, bulk loads from sources like http://www.data.gov for U.S. federal government data or log files, or transfers from relational systems.
The Hadoop ecosystem contains a variety of great tools for working with your data. However, it’s rare for your data to start or end in Hadoop. It’s much more common to have a workflow that starts with data from external systems, such as logs from your web servers, and ends with analytics hosted on a business intelligence (BI) system.
Data transfer tools help move data between those systems. More specifically, data transfer tools provide three basic capabilities:
- File transfer
- Database transfer
Tools like Sqoop (described next) provide a simple mechanism for moving data between traditional relational databases, such as Oracle or SQL Server, and your Hadoop cluster.
- Data triage
Tools like Storm (described here) can ...