Chapter 1. Introduction

While the main focus of this book is the building of event-driven systems of different sizes, there is a deeper focus on software that spans many teams. This is the realm of service-oriented architectures: an idea that arose around the start of the century, where a company reconfigures itself around shared services that do commonly useful things.

This idea became quite popular. Amazon famously banned all intersystem communications by anything that wasn’t a service interface. Later, upstart Netflix went all in on microservices, and many other web-based startups followed suit. Enterprise companies did similar things, but often using messaging systems, which have a subtly different dynamic. Much was learned during this time, and there was significant progress made, but it wasn’t straightforward.

One lesson learned, which was pretty ubiquitous at the time, was that service-based approaches significantly increased the probability of you getting paged at 3 a.m., when one or more services go down. In hindsight, this shouldn’t have been surprising. If you take a set of largely independent applications and turn them into a web of highly connected ones, it doesn’t take too much effort to imagine that one important but flaky service can have far-reaching implications, and in the worst case bring the whole system to a halt. As Steve Yegge put it in his famous Amazon/Google post, “Organizing into services taught teams not to trust each other in most of the same ways they’re not supposed to trust external developers.”

What did work well for Amazon, though, was the element of organizational change that came from being wholeheartedly service based. Service teams think of their software as being a cog in a far larger machine. As Ian Robinson put it, “Be of the web, not behind the web.” This was a huge shift from the way people built applications previously, where intersystem communication was something teams reluctantly bolted on as an afterthought. But the services model made interaction a first-class entity. Suddenly your users weren’t just customers or businesspeople; they were other applications, and they really cared that your service was reliable. So applications became platforms, and building platforms is hard.

LinkedIn felt this pain as it evolved away from its original, monolithic Java application into 800–1,100 services. Complex dependencies led to instability, versioning issues caused painful lockstep releases, and early on, it wasn’t clear that the new architecture was actually an improvement.

One difference in the way LinkedIn evolved its approach was its use of a messaging system built in-house: Kafka. Kafka added an asynchronous publish-subscribe model to the architecture that enabled trillions of messages a day to be transported around the organization. This was important for a company in hypergrowth, as it allowed new applications to be plugged in without disturbing the fragile web of synchronous interactions that drove the frontend.

But this idea of rearchitecting a system around events isn’t new—event-driven architectures have been around for decades, and technologies like enterprise messaging are big business, particularly with (unsurprisingly) enterprise companies. Most enterprises have been around for a long time, and their systems have grown organically, over many iterations or through acquisition. Messaging systems naturally fit these complex and disconnected worlds for the same reasons observed at LinkedIn: events decouple, and this means different parts of the company can operate independently of one another. It also means it’s easier to plug new systems into the real time stream of events.

A good example is the regulation that hit the finance industry in January 2018, which states that trading activity has to be reported to a regulator within one minute of it happening. A minute may seem like a long time in computing terms, but it takes only one batch-driven system, on the critical path in one business silo, for that to be unattainable. So the banks that had gone to the effort of installing real-time trade eventing, and plumbed it across all their product-aligned silos, made short work of these regulations. For the majority that hadn’t it was a significant effort, typically resulting in half-hearted, hacky solutions.

So enterprise companies start out complex and disconnected: many separate, asynchronous islands—often with users of their own—operating independently of one another for the most part. Internet companies are different, starting life as simple, front-facing web applications where users click buttons and expect things to happen. Most start as monoliths and stay that way for some time (arguably for longer than they should). But as internet companies grow and their business gets more complex, they see a similar shift to asynchronicity. New teams and departments are introduced and they need to operate independently, freed from the synchronous bonds that tie the frontend. So ubiquitous desires for online utilities, like making a payment or updating a shopping basket, are slowly replaced by a growing need for datasets that can be used, and evolved, without any specific application lock-in.

But messaging is no panacea. Enterprise service buses (ESBs), for example, have vocal detractors and traditional messaging systems have a number of issues of their own. They are often used to move data around an organization, but the absence of any notion of history limits their value. So, even though recent events typically have more value than old ones, business operations still need historical data—whether it’s users wanting to query their account history, some service needing a list of customers, or analytics that need to be run for a management report.

On the other hand, data services with HTTP-fronted interfaces make lookups simple. Anyone can reach in and run a query. But they don’t make it so easy to move data around. To extract a dataset you end up running a query, then periodically polling the service for changes. This is a bit of a hack, and typically the operators in charge of the service you’re polling won’t thank you for it.

But replayable logs, like Kafka, can play the role of an event store: a middle ground between a messaging system and a database. (If you don’t know Kafka, don’t worry—we dive into it in Chapter 4.) Replayable logs decouple services from one another, much like a messaging system does, but they also provide a central point of storage that is fault-tolerant and scalable—a shared source of truth that any application can fall back to.

A shared source of truth turns out to be a surprisingly useful thing. Microservices, for example, don’t share their databases with one another (referred to as the IntegrationDatabase antipattern). There is a good reason for this: databases have very rich APIs that are wonderfully useful on their own, but when widely shared they make it hard to work out if and how one application is going to affect others, be it data couplings, contention, or load. But the business facts that services do choose to share are the most important facts of all. They are the truth that the rest of the business is built on. Pat Helland called out this distinction back in 2006, denoting it “data on the outside.”

But a replayable log provides a far more suitable place to hold this kind of data because (somewhat counterintuitively) you can’t query it! It is purely about storing data and pushing it to somewhere new. This idea of pure data movement is important, because data on the outside—the data services share—is the most tightly coupled of all, and the more services an ecosystem has, the more tightly coupled this data gets. The solution is to move data somewhere that is more loosely coupled, so that means moving it into your application where you can manipulate it to your heart’s content. So data movement gives applications a level of operability and control that is unachievable with a direct, runtime dependency. This idea of retaining control turns out to be important—it’s the same reason the shared database pattern doesn’t work out well in practice.

So, this replayable log–based approach has two primary benefits. First, it makes it easy to react to events that are happening now, with a toolset specifically designed for manipulating them. Second, it provides a central repository that can push whole datasets to wherever they may be needed. This is pretty useful if you run a global business with datacenters spread around the world, need to bootstrap or prototype a new project quickly, do some ad hoc data exploration, or build a complex service ecosystem that can evolve freely and independently.

So there are some clear advantages to the event-driven approach (and there are of course advantages for the REST/RPC models too). But this is, in fact, only half the story. Streaming isn’t simply an alternative to RPCs that happens to work better for highly connected use cases; it’s a far more fundamental change in mindset that involves rethinking your business as an evolving stream of data, and your services as functions that transform these streams of data into something new.

This can feel unnatural. Many of us have been brought up with programming styles where we ask questions or issue commands and wait for answers. This is how procedural or object-oriented programs work, but the biggest culprit is probably the database. For nearly half a century databases have played a central role in system design, shaping—more than any other tool—the way we write (and think about) programs. This has been, in some ways, unfortunate.

As we move from chapter to chapter, this book builds up a subtly different approach to dealing with data, one where the database is taken apart, unbundled, deconstructed, and turned inside out. These concepts may sound strange or even novel, but they are, like many things in software, evolutions of older ideas that have arisen somewhat independently in various technology subcultures. For some time now, mainstream programmers have used event-driven architectures, Event Sourcing, and CQRS (Command Query Responsibility Segregation) as a means to break away from the pains of scaling database-centric systems. The big data space encountered similar issues as multiterabyte-sized datasets highlighted the inherent impracticalities of batch-driven data management, which in turn led to a pivot toward streaming. The functional world has sat aside, somewhat knowingly, periodically tugging at the imperative views of the masses.

But these disparate progressions—turning the database inside out, destructuring, CQRS, unbundling—all have one thing in common. They are all simple metaphors for the need to separate the conflation of concepts embedded into every database we use, to decouple them so that we can manage them separately and hence efficiently.

There are a number of reasons for wanting to do this, but maybe the most important of all is that it lets us build larger and more functionally diverse systems. So while a database-centric approach works wonderfully for individual applications, we don’t live in a world of individual applications. We live in a world of interconnected systems—individual components that, while all valuable in themselves, are really part of a much larger puzzle. We need a mechanism for sharing data that complements this complex, interconnected world. Events lead us to this. They constantly push data into our applications. These applications react, blending streams together, building views, changing state, and moving themselves forward. In the streaming model there is no shared database. The database is the event stream, and the application simply molds it into something new.

In fairness, streaming systems still have database-like attributes such as tables (for lookups) and transactions (for atomicity), but the approach has a radically different feel, more akin to functional or dataflow languages (and there is much cross-pollination between the streaming and functional programming communities).

So when it comes to data, we should be unequivocal about the shared facts of our system. They are the very essence of our business, after all. Facts may be evolved over time, applied in different ways, or even recast to different contexts, but they should always tie back to a single thread of irrevocable truth, one from which all others are derived—a central nervous system that underlies and drives every modern digital business.

This book looks quite specifically at the application of Apache Kafka to this problem. In Part I we introduce streaming and take a look at how Kafka works. Part II focuses on the patterns and techniques needed to build event-driven programs: Event Sourcing, Event Collaboration, CQRS, and more. Part III takes these ideas a step further, applying them in the context of multiteam systems, including microservices and SOA, with a focus on event streams as a source of truth and the aforementioned idea that both systems and companies can be reimagined as a database turned inside out. In the final part, we take a slightly more practical focus, building a small streaming system using Kafka Streams (and KSQL).

Get Designing Event-Driven Systems now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.