Chapter 3. Streaming Data Pipelines

After data is collected from real-time sources, it is added to a data stream. A stream contains a sequence of events, made available over time, with each event containing data from the source, plus metadata identifying source attributes. Streams can be untyped, but more common, the data content of streams can be described through internal (as part of metadata) or external data-type definitions. Streams are unbounded, continually changing, potentially infinite sets of data, which are very different from the traditional bounded, static, and limited batches of data, as shown in Figure 3-1. In this chapter, we discuss streaming data pipelines.

Difference between streams and batches
Figure 3-1. Difference between streams and batches

Here are the major purposes of data streams:

  • Facilitate asynchronous processing

  • Enable parallel processing of data

  • Support time-series analysis

  • Move data between components in a data pipeline

  • Move data between nodes in a clustered processing platform

  • Move data across network boundaries, including datacenter to datacenter, and datacenter to cloud

  • Do this is a reliable and guaranteed fashion that handles failure and enables recovery

Streams facilitate asynchronous handling of data. Data flow, stream processing, and data delivery do not need to be tightly coupled to the ingestion of data: these can work somewhat independently. However, if the ...

Get Streaming Integration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.