Chapter 6. Building Real-Time Data Pipelines

Technologies and Architecture to Enable Real-Time Data Pipelines

Today’s applications thrive using both real-time and historical data for fast, accurate insights. Historical data provides the background for building an initial algorithm to score real-time data as it streams into an application.

To enable real-time machine learning (ML), you will need a modern data processing architecture that takes advantage of technologies available for ingestion, analysis, and visualization. Such topics are discussed with greater depth in the book The Path to Predictive Analytics and Machine Learning (O’Reilly, 2016); however, the overview provided in the sections that follow offers sufficient background to understand the concepts of the next several chapters.

Building any real-time application requires infrastructure and technologies that accommodate ultra-fast data capture and processing. These high-performance technologies share the following characteristics:

  1. In-memory data storage for high-speed ingest

  2. Distributed architecture for horizontal scalability

  3. Queryable for instantaneous, interactive data exploration.

Figure 6-1 illustrates these characteristics.

dwaa 0601
Figure 6-1. Characteristics of real-time technologies

High-Throughput Messaging Systems

The initial input for most real-time applications begins by capturing data at its source and ...

Get Data Warehousing in the Age of Artificial Intelligence now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.