O'Reilly logo

Apache Flume: Distributed Log Collection for Hadoop by Steve Hoffman

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Tiering data flows

In Chapter 1, Overview and Architecture, we talked about tiering your data flows. There are several reasons for wanting to do this. You may want to limit the number of Flume agents that directly connect to your Hadoop cluster to limit the number of parallel requests. You may also lack sufficient disk space on your application servers to store a significant amount of data while you are performing maintenance on your Hadoop cluster. Whatever your reason or use case, the most common mechanism for chaining Flume agents is using the Avro Source/Sink pair.

Avro Source/Sink

We covered Avro a bit in Chapter 4, Sink and Sink Processors, when we discussed using it as an on-disk serialization format for files stored in HDFS. Here we'll put ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required