Chapter 3. Streaming Data Pipelines

After data is collected from real-time sources, it is added to a data stream. A stream contains a sequence of events, made available over time, with each event containing data from the source, plus metadata identifying source attributes. Streams can be untyped, but more common, the data content of streams can be described through internal (as part of metadata) or external data-type definitions. Streams are unbounded, continually changing, potentially infinite sets of data, which are very different from the traditional bounded, static, and limited batches of data, as shown in Figure 3-1. In this chapter, we discuss streaming data pipelines.

Here are the major purposes of data streams:

Facilitate asynchronous processing
Enable parallel processing of data
Support time-series analysis
Move data between components in a data pipeline
Move data between nodes in a clustered processing platform
Move data across network boundaries, including datacenter to datacenter, and datacenter to cloud
Do this is a reliable and guaranteed fashion that handles failure and enables recovery

Streams facilitate asynchronous handling of data. Data flow, stream processing, and data delivery do not need to be tightly coupled to the ingestion of data: these can work somewhat independently. However, if the ...

Get Streaming Integration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Streaming Integration by Steve Wilkes, Alok Pareek

Chapter 3. Streaming Data Pipelines

Figure 3-1. Difference between streams and batches

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly