Chapter 2. Real-Time Continuous Data Collection

As a starting point for all streaming integration solutions, data needs to be continuously collected in real-time. This is referred to as a streaming-first approach, and both streaming integration and streaming analytics solutions cannot function without this initial step. The way in which this is achieved varies depending on the data source, but all share some common requirements:

Collect data as soon as it is generated by the source
Capture metadata and schema information from the source to place alongside the data
Turn the data into a common event structure for use in processing and delivery
Record source position if applicable for lineage and recovery purposes
Handle data schema changes
Scale through multithreading and parallelism
Handle error and failure scenarios with recovery to ensure that no data is missed

The following sections explain how we can implement these requirements for a variety of different source categories – databases, files and logs, messaging systems, cloud and APIs, and devices and IoT – and will provide examples to clarify each case.

Databases and Change Data Capture

A database represents the current state of some real-world application and is most meaningful in the context of transaction processing. Applications submit queries and updates from a number of network endpoints that are managed as a series of transactions for state observance and transition.

From the late 1970s to the beginning ...

Get Streaming Integration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Streaming Integration by Steve Wilkes, Alok Pareek

Chapter 2. Real-Time Continuous Data Collection

Databases and Change Data Capture

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly