Chapter 7. Data Ingestion and Streaming Tools
Data ingestion tools facilitate linking operational applications with analytical tools that produce reports and can provide organized data for machine learning models. Ingestion tools also greatly affect the data processing abilities of an organization, because data that cannot be accurately and reliably ingested in a timely manner will lose its usefulness. In this chapter, we cover some of the more well-known contemporary data ingestion and streaming tools.
By understanding the strengths and trade-offs of these tools, organizations can design robust and scalable ingestion pipelines that align with their strategic objectives. And, as data continues to grow in volume, velocity, and variety, selecting the right tools and frameworks will remain crucial for maintaining competitive advantage.
Apache Beam, Flink, Spark, and Storm
Apache Beam is an open source, unified programming environment for defining both batch and streaming data processing pipelines. It allows developers to build pipelines that can run on a variety of execution engines (or runners), such as Apache Flink and Apache Spark (which we’ll go over shortly). Beam abstracts the complexities of parallel computing and simplifies the development of data-intensive applications. It supports key features like windowing, event-time processing, and a rich set of built-in transforms, making it flexible for near-real-time and batch workloads.
Beam can read your data from a diverse set ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access