Chapter 3. Moving from Data Silos to Real-Time Data Pipelines

Providing a modern user experience at scale requires a streamlined data processing infrastructure. Users expect tailored content, short load times, and information to always be up-to-date. Framing business operations with these same guiding principles can improve their effectiveness. For example, publishers, advertisers, and retailers can drive higher conversion by targeting display media and recommendations based on users’ history and demographic information. Applications like real-time personalization create problems for legacy data processing systems with separate operational and analytical data silos.

The Enterprise Architecture Gap

A traditional data architecture uses an OLTP-optimized database for operational data processing and a separate OLAP-optimized data warehouse for business intelligence and other analytics. In practice, these systems are often very different from one another and likely come from different vendors. Transferring data between systems requires ETL (extract, transform, load) (Figure 3-1).

Legacy operational databases and data warehouses ingest data differently. In particular, legacy data warehouses cannot efficiently handle one-off inserts and updates. Instead, data must be organized into large batches and loaded all at once. Generally, due to batch size and rate of loading, this is not an online operation and runs overnight or at the end of the week. 

Figure 3-1. Legacy data processing model ...

Get Building Real-Time Data Pipelines now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.