Chapter 2. Real-Time Continuous Data Collection

As a starting point for all streaming integration solutions, data needs to be continuously collected in real-time. This is referred to as a streaming-first approach, and both streaming integration and streaming analytics solutions cannot function without this initial step. The way in which this is achieved varies depending on the data source, but all share some common requirements:

  • Collect data as soon as it is generated by the source

  • Capture metadata and schema information from the source to place alongside the data

  • Turn the data into a common event structure for use in processing and delivery

  • Record source position if applicable for lineage and recovery purposes

  • Handle data schema changes

  • Scale through multithreading and parallelism

  • Handle error and failure scenarios with recovery to ensure that no data is missed

The following sections explain how we can implement these requirements for a variety of different source categories – databases, files and logs, messaging systems, cloud and APIs, and devices and IoT – and will provide examples to clarify each case.

Databases and Change Data Capture

A database represents the current state of some real-world application and is most meaningful in the context of transaction processing. Applications submit queries and updates from a number of network endpoints that are managed as a series of transactions for state observance and transition.

From the late 1970s to the beginning ...

Get Streaming Integration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.