Skip to Main Content
Using Flume
book

Using Flume

by Hari Shreedharan
September 2014
Intermediate to advanced content levelIntermediate to advanced
238 pages
6h 17m
English
O'Reilly Media, Inc.
Content preview from Using Flume

Chapter 2. Streaming Data Using Apache Flume

Pushing data to HDFS and similar storage systems using an intermediate system is a very common use case. There are several systems, like Apache Flume, Apache Kafka, Facebook’s Scribe, etc., that support this use case. Such systems allow HDFS and HBase clusters to handle sporadic bursts of data without necessarily having the capacity to handle that rate of writes continuously. These systems act as a buffer between the data producers and the final destination. By virtue of being buffers, they are able to balance out the impedance mismatch between the producers and consumers, thus providing a steady state of flow. Scaling these systems is often far easier than scaling HDFS or HBase clusters. Such systems also allow the applications to push data without worrying about having to buffer the data and retry in case of HDFS downtime, etc.

Most such systems have some fundamental similarities. Usually, these systems have components that are responsible for accepting the data from the producer, through an RPC call or HTTP (which may be exposed via a client API). They also have components that act as buffers where the data is stored until it is removed by the components that move the data to the next hop or destination. In this chapter, we will discuss the basic architecture of a Flume agent and how to configure Flume agents to move data from various applications to HDFS or HBase.

Apache Hadoop is becoming a standard data processing framework in ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Apache Flume: Distributed Log Collection for Hadoop - Second Edition - Second Edition

Apache Flume: Distributed Log Collection for Hadoop - Second Edition - Second Edition

Steven Hoffman
Java Data Objects

Java Data Objects

David Jordan, Craig Russell

Publisher Resources

ISBN: 9781491905326ErrataSupplemental Content