Chapter 2. Moving data in and out of Hadoop


This chapter covers
  • Understanding key design considerations for data ingress and egress tools
  • Techniques for moving log files into HDFS and Hive
  • Using relational databases and HBase as data sources and data sinks


Moving data in and out of Hadoop, which I’ll refer to in this chapter as data ingress and egress, is the process by which data is transported from an external system into an internal system, and vice versa. Hadoop supports ingress and egress at a low level in HDFS and MapReduce. Files can be moved in and out of HDFS, and data can be pulled from external data sources and pushed to external data sinks using MapReduce. Figure 2.1 shows some of Hadoop’s ingress and egress mechanisms. ...

Get Hadoop in Practice now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.