Chapter 1. Streaming Foundations

The hero’s journey always begins with the call. One way or another, a guide must come to say, “Look, you’re in Sleepy Land. Wake. Come on a trip. There is a whole aspect of your consciousness, your being, that’s not been touched. So you’re at home here? Well, there’s not enough of you there.” And so it starts.

Joseph Campbell, Reflections on the Art of Living: A Joseph Campbell Companion

The streaming database is a concept born from over a decade of processing and serving data. The evolution leading to the advent of streaming databases is rooted in the broader history of database management systems, data processing, and the changing demands of the digital age. To understand this evolution, let’s take a historical journey through the key milestones that have shaped the development of streaming databases.

The rise of the internet and the explosive growth of digital data in the late 20th century led to the need for more scalable and flexible data management solutions. Data warehouses and batch-oriented processing frameworks like Hadoop emerged to address these challenges of the size of data during this era.

The term “big data” was and still is used to refer not only to the size of data but also to all solutions that store and process data that is extremely large. Big data cannot fit on a single computer or server. You need to divide it up into smaller, equal-sized parts and store them in multiple computers. Systems like Hadoop and MapReduce became popular because they enabled distributed storage and processing.

This led to the idea of using distributed streaming to move large volumes of data into Hadoop. Apache Kafka emerged as one such messaging service that was designed to handle big data. Not only did it provide a way to move data from system to system, but it also provided a way to access data in motion—in real time. It was a development that led to a new wave of demand for real-time streaming use cases.

New technologies, such as Apache Flink and Apache Spark, were developed and were able to meet these new expectations. As distributed frameworks for batch processing and streaming, they could process data across many servers and provide analytical results. When coupled with Kafka, the trio provided a solution that could support streaming real-time analytical use cases. We’ll discuss stream processors in more detail in Chapter 2.

In the mid-2010s, simpler and better paradigms in streaming emerged to increase the scale of real-time data processing. This included two new stream processing frameworks, Apache Kafka Streams (KStreams) and Apache Samza. KStreams and Samza were the first to implement materialized views, which made the stream look and feel more like a database.

Martin Kleppmann took the pairing of databases and streaming even further. In his 2015 talk, “Turning the Database Inside-Out”, he described a way to implement stream processing that takes internal database features and externalizes them in real-time streams. This approach led to more scalable, resilient, and real-time stream processing systems.

One of the problems of stream processing was (and still is) that it’s harder to use than batch processing. There are fewer abstractions, and much more deep-down tech is shining through. To implement stream processing for their use case, data engineers now had to consider data order, consistency for accurate processing, fault tolerance, resilience, scalability, and more. This became a hurdle that deterred data teams from attempting to use streaming. As a result, most have opted to continue using databases to transform data and running the data processing in batches at the expense of not meeting performance requirements.

In this book, we hope to make streaming and stream processing more accessible to those who are used to working with databases. We’ll start, as Kleppmann did, by talking about how to turn the database inside out.

Turning the Database Inside Out

Martin Kleppmann is a distinguished software developer who gave the thought-provoking talk “Turning the Database Inside-Out.” He introduced Apache Samza as a newer way of implementing stream processing that takes internal database features and externalizes them in real-time streams. His thought leadership led to the paradigm shift of introducing materialized views to stream processing.

Really it’s a surreptitious attempt to take the database architecture we know and turn it inside out.

Martin Kleppmann, “Turning the Database Inside-Out”

However, stream processing is still hard, and hence, many data engineers have, over time, opted to continue using databases to transform data and run it in batches even if it meant not meeting SLA requirements.

As we move forward in this book, we will attempt to make streaming and stream processing more accessible to data engineers by bringing them back into the database. But before we can do this, we need to understand why Kleppmann decided to take apart the database and why he chose the specific database features in his new paradigm to achieve real-time data processing.

Externalizing Database Features

Kleppmann identified two important features in the database: the write-ahead log (WAL) and the materialized view. As it turns out, these features naturally have streaming characteristics that provide a better way of processing data in real time.

Write-Ahead Log

The WAL is a mechanism that allows databases to ensure data durability and consistency. The spinning disks that databases write data upon don’t offer transactions. So databases are challenged to provide transactionality atop a device that doesn’t offer transactions. WALs are a way for databases to provide transactionality without having transactional disks.

A transaction in a database refers to a sequence of one or more database operations executed as a single unit of work. These operations can include data insertion (INSERT), data modification (UPDATE), or data deletion (DELETE) (see Figure 1-1).

][alt
Figure 1-1. A write-ahead log that captures change events in a database

The WAL acts as a buffer that can be overwritten as new changes are made. The WAL persists the change to disk, as shown in Figure 1-2.

][alt
Figure 1-2. Database writing to disk through the write-ahead log

When saving transactions on disk, the database follows these steps:

  1. The client starts a transaction by issuing a BEGIN statement.

  2. The database writes a record to the WAL indicating that a transaction has started.

  3. The client makes changes to the database data.

  4. The client commits the transaction by issuing a COMMIT statement.

  5. The database writes a record to the WAL indicating that the transaction has been committed.

  6. The changes made by the transaction are written to disk.

When a transaction starts, the database will write a record to the WAL indicating that the transaction has started. The database will then proceed to make changes to the database data. However, the changes will not be written to disk until the transaction commits. Also, if the database crashes or loses power, the changes can be replayed from the log, and the database can be restored to a consistent state.1

The WAL provides a mechanism to capture database transactions in real time by allowing external systems to subscribe to it. One of these use cases is for database disaster recovery. By reading the WAL, data can be replicated to a secondary database. If the primary database were to suffer an outage, database clients can failover to the secondary database, which is a replica of the primary database (see Figure 1-3).

][alt
Figure 1-3. The WAL is used to replicate data from a primary database to a secondary in the case of a primary database outage

Since WALs are receiving transactions in real time, they naturally have perfect semantics for streaming. Clients can subscribe to the WAL and forward their transactions to a streaming platform for other systems to consume. These other systems can also build replicas that represent the original primary database. The semantics of the WAL construct are mimicked in streaming platforms like Kafka in their storage implementation. Streaming platforms extend the database WAL externally for other applications and systems to use.

There are other streaming-related concepts about the WAL. After the transactions are committed, the WAL is not cleared immediately. Instead, it follows a process called checkpointing, which involves periodically flushing the transactions of the WAL to the main data files. Checkpointing serves several purposes, one of which is ensuring that some committed changes have been permanently written to the data files, reducing the amount of data that needs to be replayed during recovery after a crash. This helps speed up the recovery process. Also, as transactions are committed, the WAL grows over time. Checkpointing helps control the size of the WAL by flushing some of its contents to the data files. This prevents the WAL from becoming excessively large and consuming too much disk space. Checkpointing and replaying transactions are features you will also find in stream processing for very similar reasons.

We mentioned that the WAL construct that normally lives internally in the database can be represented externally in streaming platforms like Kafka, which provide WAL-like semantics when replicating data from system to system.

Streaming Platforms

Streaming platforms like Apache Kafka are distributed, scalable, and fault-tolerant systems designed to handle real-time data streams. They provide a powerful infrastructure for ingesting, storing, and processing large volumes of continuous data from various sources.

Most streaming platforms have a construct called partitions. These constructs mimic WALs in a database. Transactions are appended to partitions like transactions to a WAL. Streaming platforms can hold many partitions to distribute the stream load to promote horizontal scaling. Partitions are grouped in abstractions called topics to which applications either publish or consume transactions.

By publishing the transactions to the streaming platform, you’ve published it to all subscribers who may want to consume it. This is called a publish and subscribe model, and it’s critical to allow multiple disparate consumers to use these transactions.

For other streaming platforms, the names of these constructs may be different. Table 1-1 lists some alternative streaming platforms. Apache Kafka is the most popular streaming platform used today. In Apache Kafka, the abstraction of these constructs is called a topic, and the lower-level partitions are called partitions.

Table 1-1. Alternative streaming platforms
Streaming platform name Description Implementation Topic name Partition name Kafka compliant

Memphis

Memphis is an open source next-generation alternative to traditional message brokers.

GoLang

Station

Stream

No

Apache Pulsar

Apache Pulsar is an open source distributed messaging and streaming platform that was originally developed at Yahoo!

Java

Topic

Ledger

Yes—currently, the Pulsar Kafka wrapper supports most of the operations offered by the Kafka API.

Redpanda

Redpanda is an open source streaming platform designed to provide a high-performance, scalable, and reliable way to handle real-time data streams.

C++

Topic

Partition

Yes

WarpStream

WarpStream is a Kafka-compatible data streaming platform built directly on top of S3.

GoLang

Topic

Partition

Yes

Gazette

Gazette is a lightweight open source streaming platform.

GoLang

Selector

Journal

No

Pravega

Pravega is a stream processor that provides streaming storage abstraction for continuously generated and unbounded data.

Java

Stream

Stream Segment

Kafka adapter available

Note

In this book, we’ll use the terms “topic” and “partition” as the names of the streaming platform constructs that hold real-time streaming data.

Since Kafka is the most popular streaming platform used today, the last column in Table 1-1 indicates if the streaming platform supports Kafka clients. This will allow applications to swap out Kafka for another Kafka-compliant streaming platform.

As stated, a partition is a mechanism that streaming platforms use to scale themselves out. The more partitions a topic has, the more it can distribute the data load. This enables more consumer instances to process the transactions in parallel. The way transactions get distributed across partitions is by using a key assigned to the transaction. In Figure 1-4, the WAL in the database is read and stored into a topic in a streaming platform—on a higher level of abstraction than just a disk.

][alt
Figure 1-4. Topics in streaming platform can mimic a WAL and externalize it for other systems to build replicas of the original source database

Instead of storing the data for others to query, the streaming platform reconstructs the WAL and distributes the transactions onto separate partitions.2 Reconstructing the WAL exposes the transactions to other data systems to build replicas of the primary database.

Partitions are immutable append-only logs that streaming platforms use to capture and serve transactions. Many consumers can subscribe to them utilizing offsets. Offsets correspond to the index or position of a transaction in the partition.3 Every consumer of the topic has an offset pointer to keep track of their position in the partition. This allows consumers to read and process the transactions in the partition at their own pace. A side effect of this is that the streaming platform will have to retain transactions in the partitions for a longer time than databases retain their transactions in a WAL. The default retention in Kafka is 7 days. This gives a lot of time for slow consumers to process the transactions in the topic. This property is also configurable to allow even longer retention time as well.

Regarding Figure 1-4, the way you should think about publishing transactions to a topic should be a lot different than writing to disk. The important fact about topics in a streaming platform is that when transactions are published into them, they are still considered streaming. Let’s use a water metaphor to help explain this. When you consume water from a faucet, you would consider that fresh water. This is the same with streaming platforms. When you consume transactions from a topic, they too are considered fresh. Conversely, if you bring a liter of water into your home and don’t drink it for a while, it’s considered stale. Stale or stagnant water is susceptible to bacterial growth and is unclean. The stale liter of water is more akin to batching of data.

On the other hand, if you don’t use your faucets in over a month, the water sourced from faucets may contain rust or debris, indicating the water has become stale. In this case, water coming from faucets will not always be fresh. Streaming platforms tend to have a mechanism that protects them from stale transactions. To avoid publishing stale transactions, retention is applied to the topics. Transactions can be purged after a retention period configured by the user of the streaming platform.

To recap, primary OLTP databases naturally write to a WAL when storing to spinning disks. WALs can be used to replicate data to a secondary OLTP database for disaster recovery scenarios. Streaming platforms like Kafka can be used to externalize the database WAL using partitions abstracted by topics to provide the transactions that were originally in the WAL to other systems. These systems subscribe to the topic so that they can build their replica of the tables in the original primary OLTP database just like the secondary OLTP database did (see Figure 1-5). Hence, streaming platforms can be used to make the WALs previously hidden in your OLTP database systems publicly available—becoming a tool for synchronizing your database systems across your entire organization.

][alt
Figure 1-5. The partitions in a topic can hold the transactions from a source OLTP database and publish the transactions for other systems to build replica tables

With a similar approach, we can build materialized views in stream processing platforms.

Materialized Views

In typical OLTP databases, materialized views are special types of database objects that store the results of a precomputed query or aggregation. Unlike regular views, which are virtual and dynamically generate their results based on the underlying data, materialized views store the actual data, making them physically stored in the database.

The purpose of materialized views is to improve the performance of complex queries or aggregations by precomputing and storing the results. When a query references a materialized view, the database can quickly retrieve the precomputed data from the materialized view instead of recalculating it from the base data tables. This can significantly reduce the query execution time and improve overall database performance, especially for large and resource-intensive queries.

The materialization process in databases usually needs to be refreshed manually to keep the stored results fresh. Example 1-1 shows an example of how to refresh a materialized view in a Postgres database, a popular OLTP database.

Example 1-1. Refreshing a materialized view in Postgres
REFRESH MATERIALIZED VIEW CONCURRENTLY product_sales;

By enabling materialized views to be updated, the stored data will always be fresh; that is, the stored data is real-time data. This characteristic makes materialized views fit naturally into streaming frameworks.

In the previous section, streaming platforms could hold transactions from OLTP WALs. These partitions mimic the WAL construct, so other systems can build replicas of tables in the original OLTP database. This same approach can be applied in stream processors to build tabular structures (see Figure 1-6).

][alt
Figure 1-6. Replicas can also be built in stream processors the same way other systems can build replicas

We’ll talk more about stream processing in Chapter 2. We also have dedicated Chapter 3 to materialized views because their importance to streaming databases is substantial. To best explain streaming databases, it helps to set up a simple use case that we can follow to the end. Along the way, we’ll identify each system needed to accomplish the goal of the use case.

Use Case: Clickstream Analysis

Let’s start by defining a simple use case. This use case will help create a path to a better understanding of streaming databases, how they can resolve a real-time use case, and the advantages they bring when architecting a real-time solution.

Our use case will involve clickstream data. Clickstream data refers to a sequence of recorded events that capture the actions and interactions of users as they navigate through a website, application, or digital platform. It provides a detailed record of the clicks, page views, and other interactions performed by users during their online sessions.

Clickstream data can be used for various purposes, like personalization, targeted advertising, user segmentation, fraud detection, and conversion rate optimization. It plays a crucial role in web analytics, marketing analytics, user experience research, and other data-driven disciplines. In Figure 1-7, a customer clicks on a product, generating a click event that is captured by a microservice. That click event is sent to downstream analytical consumption.

][alt
Figure 1-7. A user clicks on a green T-shirt, generating a click event captured by a microservice

In our use case, a 24-year-old male customer who lives in Woodstock, NY, clicks on a green T-shirt using a phone application. Our goal is to provide clickstream data to end users so that they can perform analytics and derive insights that help make data-driven decisions.

Let’s say in this example, we want to capture click events and associate them with existing customers. This will help analytics to provide targeted marketing and to create a more personalized experience.

We call the data going into a WAL in an OLTP database transactions. We call the clicks we capture from a user-facing application in our use case events. They both will eventually end up in a streaming platform like Kafka so that we can eventually join them together.

Understanding Transactions and Events

So far, we have called the data that originated from a database a transaction. These are the inserts, updates, and deletes that have occurred and got written to the WAL and subsequently got written into a topic in a streaming platform. We can also call these transactions change events or just events. They are insert events, update events, and delete events, just like a click on an application is an event.

Even though they are both events, it’s extremely important to understand that they are still different types of events. One comes from changes to tables in a database, and the other comes from the actions taken on an application. To discern their differences, we’ll need to briefly go over domain-driven design.

Domain-Driven Design

In software, engineers will model their applications using objects that exist in their business domain. For example, if part of your business includes customers, then you will create an object in the application that represents a customer. You would do this for every object included in the business domain.

Let’s build a model that describes the objects in our use case. Customers and products are objects that would be part of the domain model that defines this application. These objects are called entities. Entities live in the OLTP database and undergo change events like inserts, updates, and deletes.

Events like click events capture the interactions between the entities in the application. In our example, a customer clicks on a product. The customer and product are the objects, and the action is the click of the product. This is represented in Figure 1-7.

We can use the structure of a sentence to describe this relationship. A sentence contains a subject, verb, and an object. The subject in a sentence generally is the entity that’s carrying out an action. The verb describes the action. Lastly, the object is the entity upon which the action is being applied. In our use case, the sentence is:

The customer clicked on a product.

Click events tend to provide a lot more information, so we can expand this sentence with more description:

The customer with IP 111.11.1111 clicked on product 12345 on 07/21/2023 at
11:20 am Eastern time.

Notice that we don’t know the name of the customer or product, nor do we know the customer’s location or age. We also don’t know the product type or its color. We have to enrich the clickstream event with customer and product information before delivering it for analysis.

One question you could ask is, “Why can’t the click event also be stored in the database?” This is a valid question. Why not use the WAL to read the click events together with the entities? One major reason is the OLTP database could run out of space. If you think about how many times a customer clicks on items in an application, it would not be sensible to store all that data in an OLTP database. While entities tend to change very slowly, they can be deleted or updated. In contrast, click events are immutable and would be inserted only into a table. This pattern is also called append-only. Click events are better captured using a microservice that writes to a streaming platform directly.

Another difference to note is that the action event is being enriched, and the entity events are being used to enrich. Knowing the differences between action events and entity change events will be significant throughout this book. Each type will be handled differently as they flow through the streaming data pipeline until they are served to the end user.

Context Enrichment

All forms of analytical consumption need a context in which the event occurred. The click event, as noted earlier, only contains the information related to the click but neither the customer nor the product information. Typically, entity information is not available at the time of the click event. If it were, collecting and enriching clickstream data in the application would not be economical or scalable because of the size of the data and the latency it would generate.

The better way to enrich the click event is to perform it downstream of a real-time data pipeline. Having this additional information will help make more informed decisions. For example, if the customer likes green shirts and is a male in his 20s, knowing that information will help enable smarter decisions and make the application more personalized.

In our use case, the click event is associated with two other entities in the business domain: the customer and the clicked product. Combining these details of the entities with the click event will create a more compelling context needed for real-time analytics. Compelling analytics can tell us more about the event and how to quickly react to issues like deciding to increase the inventory of men’s green shirts.

We know entities that are part of the domain of the application exist in the OLTP database. We also know that changes to these entities are written to the WAL. But we did not talk about how the events in the WAL make it into a topic in a streaming platform, where other systems can consume change events and build their replica of the entities in the application. The replica will enable this enrichment of the click event with product and customer information in a stream processor downstream. The process of creating this replica is called change data capture.

Change Data Capture

Change data capture (CDC) is a technique used in databases and data integration systems to capture and track changes made to the data in real time. The primary goal of CDC is to identify and capture any change transactions (inserts, updates, or deletes) made to specific tables and make the changes events available for consumption by downstream systems or processes.

When performing CDC, you can either subscribe to a stream of transactions that have been executed or you can capture a snapshot. Snapshots are not change events as you would see in WAL. In database terminology, a snapshot refers to a copy of a database (or table in a database) taken at a particular point in time, just like taking a snapshot with a camera. Streams sourced from databases are akin to compressed videos, where each frame of the video isn’t a picture (or snapshot) but pixel changes from one frame to the next.

Note

The type of video that provides only changes in each frame versus snapshots to save processing time is called delta encoding. Delta encoding is a video compression technique that stores only the differences between consecutive frames. This can significantly reduce the size of the video file while still preserving the original video quality.

CDC can be implemented in a few ways:

Listening to the WAL

This is the approach we’ve been discussing in this chapter and the preferred way of capturing changes in a database. It’s done in real time and naturally is streaming.

Note

The WAL approach to capturing change transactions is typically used by relational OLTP databases like PostgreSQL and MySQL. We talk about it because of how similar it is to the constructs that hold streams in streaming platforms. Some NoSQL transactional databases may not follow this approach but have some other mechanism to capture changes.

Comparing snapshots

This involves taking a snapshot of a table and comparing it to a previous snapshot to filter out changes. This act can be process intensive, especially if the table is large. Also, this approach is not true real time. Snapshots are taken in intervals. Changes that include a reversion that occurs in between intervals would be lost. Suspicious change and reversion events might sometimes go undetected.

Comparing update timestamps

This approach saves the timestamp of the last batch of changes and filters for records with update timestamps that occur after it. This approach requires an update column included in the table that needs to be updated anytime the record is changed. This approach is also not in real time.

Fortunately, most OLTP databases have some way of reading their WAL. Some OLTP databases also have native support for submitting events to streaming platforms or other systems. For example, CockroachDB provides a way to create a change feed from itself to:

  • Kafka

  • Google Cloud Pub/Sub

  • Cloud Storage (Amazon S3, Google Cloud Storage, Azure Storage)

  • Webhook

This avoids requiring a client to subscribe to the WAL in CockroachDB and, instead, CockroachDB pushes change events to Kafka directly (see Example 1-2). This is a preferred pattern because it significantly reduces the architectural complexity of the streaming data pipeline.

Example 1-2. Creating a change feed from CockroachDB to Kafka
CREATE CHANGEFEED FOR TABLE customer, product INTO 'kafka://localhost:9092';

Having this feature natively in OLTP databases fundamentally brings them closer to streaming databases. We’ll discuss streaming databases in Chapter 5.

Warning

Even if you were Martin Kleppmann himself, Chapters 1 to 4 are critical reading before Chapter 5. Please don’t skip ahead because they provide foundational information supporting the introduction of streaming databases in Chapter 5.

As stated, this push mechanism reduces architectural complexity. Other OLTP databases that don’t have this feature require additional components called connectors to extract data and publish them into a topic in a streaming platform.

Connectors

In streaming, we distinguish two main types of connectors:

Source connectors

Source connectors read data from a data source system (e.g., a database) and make that data available as an event stream.

Sink connectors

Sink connectors consume data from an event stream and write that data to a sink system (a database again or a data warehouse, a data lake, etc.).

The two types of connectors are depicted in Figure 1-8. In most cases, source connectors either transform data at rest into streaming data (aka data in motion), whereas sink connectors transform streaming data into data at rest.

][alt
Figure 1-8. Source connectors (top) and sink connectors (bottom)

With data at rest, we mean that the data is sitting in a database or a filesystem and not moving. Data at rest tends to get processed using batching or microbatching techniques.4 A dataset batched from one source system to another has a beginning and an end. The applications that process batched data can be started using a job scheduler like cron, and the data processing ends when the dataset ends.

This is the opposite of streaming, or data in motion. Data in motion implies that there is neither a beginning nor an end to the data. Applications that process streaming data are always running and listening for new data to arrive on the stream.

Now let’s dive into how source and sink connectors can be implemented.

Connector Middleware

Connector middleware solutions such as Kafka Connect, Meroxa, Striim, or StreamSets already provide a large number of connectors out of the box and are often extensible to serve further sources and sinks. Connector middlewares also offer horizontal scaling, monitoring, and other required features, especially for production deployments.

Kafka Connect is part of the Apache Kafka project. It is a distributed cluster in which Kafka connectors are deployed to run in parallel. These types of deployments create complexity in the streaming architecture. These clusters are bulky and maintenance of them is arduous.

If you have a large amount of data sources and sinks, these clusters often become costly and consume a lot of resources. Delegation of this integration is better solved by embedding the connectors into the systems themselves.

Embedded

An increasing number of databases offer embedded connectors to streaming platforms. As we stated earlier, CockroachDB is an example of this. An even larger set of databases has implemented embedded connectors; that is, they can consume data off the event stream themselves. Examples are Apache Druid, Apache Pinot, ClickHouse, StarRocks, Apache Doris, and Rockset.

As we stated, having databases solve the integration to streaming platforms gets them closer to becoming streaming databases. If you enable databases with the ability to pull and push data into streaming platforms, streaming will naturally become a first-class citizen in the database.

Custom-Built

Connectors can be custom-built, for example, by implementing a dedicated microservice. The advantage of this approach is its flexibility; the downside is clearly the need to “reinvent the wheel”—it often doesn’t make sense to implement connectors from scratch, especially in light of the plethora of existing powerful and scalable open source connectors (e.g., the Debezium source connectors for the Kafka Connect middleware).

In Figure 1-9, we have illustrated the three ways of implementing connectors (the illustration only shows source connectors for simplicity; the corresponding sink connector implementations would just be a mirror image of this illustration).

][alt
Figure 1-9. Ways of implementing connectors via connector middleware, built-in connector, or custom-built connector
Note

In the remainder of this book, we’ll abstract away from the actual implementation of the connectors. When we speak about a “connector,” this can be a connector based on a connector middleware, a built-in connector, or a custom-built one.

Back to our example use case. Here, we want to enrich click events with product and customer information. Most likely, this data would reside in a transactional database or online transactional processing (OLTP) database. To make this data available as an event stream, we need to use a source connector for that database.

Note

An OLTP database is also called an operational database, and it refers to a type of database designed to handle a high volume of transactions. OLTP databases are designed to provide fast data access and updates, which is important for applications that require real-time data processing.

In Figure 1-10, you can see that the product and customer information is stored in an OLTP database. Two database source connectors read from this database and write them to topics (“Product topic” and “Customer topic”). The click events are written to the “Click event topic.”

][alt
Figure 1-10. Customer and product data in the database

Summary

In this chapter, we introduced some foundational streaming concepts by introducing Martin Kleppmann and his approach to turning the database inside out. By doing so, we identified two features that lay the foundation for streaming and stream processing: the database WAL and the materialized view.

We learned that the topics in streaming platforms are externalized database WALs for other systems to subscribe to. These other systems can then build replicas of tables from the source database using, for example, CDC (or other forms of connectors) and perform their processing of the real-time data.

In the next chapter, we’ll continue with the clickstream use case example and bring it to the next step—the stream processing platform where the enrichment will occur.

1 There are other types of recovery algorithms in addition to what we describe (roll forward/replay). Changes to the actual data could be also applied before the COMMIT is done (given that the WAL contains the old value), and uncommitted transactions could then be rolled back.

2 Transactions can span multiple records. In this case, the key for distributing the data onto partitions does not directly correspond to the primary key in the source database.

3 Again, for simplicity, our transaction here only holds one record.

4 It’s debatable whether microbatching is actually closer to streaming or batch.

Get Streaming Databases now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.