Data Lakes can be filled with data coming from multiple sources at different speeds. The tools in the ingestion tier, such as Apache Flume, can handle the massive volume of incoming data and store it on HDFS.
Apache Flume is a distributed and scalable tool that can reliably collect the data from different sources and move it to a centralized data store on HDFS. Massive volumes of data can be generated in the form of weblogs or sensor data and stored on HDFS for analysis and distribution. Though the typical use cases of Apache Flume involve collection and storage of log data, it can be used to ingest any kind of data in HDFS.
Understanding the Design of Flume
Flume is an agent-based system. It contains three components:
- Source: The source ...