Apache Flume

Data Lakes can be filled with data coming from multiple sources at different speeds. The tools in the ingestion tier, such as Apache Flume, can handle the massive volume of incoming data and store it on HDFS.

Apache Flume is a distributed and scalable tool that can reliably collect the data from different sources and move it to a centralized data store on HDFS. Massive volumes of data can be generated in the form of weblogs or sensor data and stored on HDFS for analysis and distribution. Though the typical use cases of Apache Flume involve collection and storage of log data, it can be used to ingest any kind of data in HDFS.

Understanding the Design of Flume

Flume is an agent-based system. It contains three components:

  • Source: The source ...

Get Hadoop Blueprints now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.