Chapter 7. Product Changes Captured with Change Data Capture

The operations team at AATD is now able to get a solid overview of the number of orders and the revenue that the business is making. What’s missing is that they don’t know what’s happening at the product level. Complaints from other parts of the business indicate that some products are seeing surges in orders while there’s too much stock for other items.

The data about individual products is currently stored in the MySQL database, but we need to get it out of there and into our real-time analytics architecture. In this chapter, we’ll learn how to do this using a technique called change data capture (CDC).

Capturing Changes from Operational Databases

Businesses often record their transactions in operational, or OLTP, databases. Businesses often want to analyze their operational data, but how should they go about doing that?

Traditionally, ETL pipelines have been used to move data from operational databases to analytical databases like data warehouses. Those pipelines were executed periodically, extracting data from source databases in large batches. After that, the data was transformed before loading it into the analytics database.

The problem with this classic approach was the significant latency between data collection and decision making. For example, a typical batch pipeline would take minutes, hours, or days to generate insights from operational data.

What if there was a mechanism to capture changes made to source ...

Get Building Real-Time Analytics Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.