Chapter 4. Decomposing the Database
As we’ve already explored, there are a host of ways to extract functionality into microservices. However, we need to address the elephant in the room: namely, what do we do about our data? Microservices work best when we practice information hiding, which in turn typically leads us toward microservices totally encapsulating their own data storage and retrieval mechanisms. This leads us to the conclusion that when migrating toward a microservice architecture, we need to split our monolith’s database apart if we want to get the best out of the transition.
Splitting a database apart is far from a simple endeavor, however. We need to consider issues of data synchronization during transition, logical versus physical schema decomposition, transactional integrity, joins, latency, and more. Throughout this chapter, we’ll take a look at these issues and explore patterns that can help us.
Before we start with splitting things apart, though, we should look at the challenges—and coping patterns—for managing a single shared database.
But It Can’t Be Done!
So, ideally, we want our new services to have their own independent schemas. But that’s not where we start with an existing monolithic system. Does that mean we should always split these schemas apart? I remain convinced that in most situations this is the appropriate thing to do, but it isn’t always feasible initially.
Sometimes, as we’ll explore shortly, the work involved will take too long, or involves changing especially sensitive parts of the system. In such cases, it can be useful to use a variety of coping patterns that will at the very least stop things from getting any worse, and at best can be sensible stepping stones toward something better in the future.
You’re going to encounter problems with your current system that seem impossible to deal with right now. Address the problem with the rest of your team so that everyone can come to an agreement that this is a problem you’d like to fix, even if you can’t see how right now. Then make sure you at least start doing the right thing now. Over time, problems that initially seemed insurmountable will become easier to deal with once you have some new skills and experience.
Pattern: Database View
In a situation where we want a single source of data for multiple services, a view can be used to mitigate the concerns regarding coupling. With a view, a service can be presented with a schema that is a limited projection from an underlying schema. This projection can limit the data that is visible to the service, hiding information it shouldn’t have access to.
The Database as a Public Contract
Back in Chapter 3, I discussed my experiences in helping re-platform an existing credit derivative system for a now defunct investment bank. We hit the issue of database coupling in a big way: we had a need to increase the throughput of the system in order to provide faster feedback to the traders who used the system. After some analysis, we found that the bottleneck in the processing was the writes being made into our database. After a quick spike, we realized we could drastically increase the write performance of our system if we restructured the schema.
It was at this point we found that multiple applications outside our control had read access to our database, and in some cases read/write access. Unfortunately, we found that all these external systems had been given the same username and password credentials, so it was impossible to understand who these users were, or what they were accessing. We had an estimate of “over 20” applications as being involved, but that was derived from some basic analysis of the inbound network calls.1
If each actor (e.g., a human or an external system) has a different set of credentials, it becomes much easier to restrict access to certain parties, reduce the impact of revoking and rotating credentials, and better understand what each actor is doing. Managing different sets of credentials can be painful, especially in a microservice system that may have multiple sets of credentials to manage per service. I like the use of dedicated secret stores to solve this problem. HashiCorp’s Vault is an excellent tool in this space, as it can generate per-actor credentials for things like databases that can be short lived and limited in scope.
So we didn’t know who these users were, but we had to contact them. Eventually, someone had the idea of disabling the shared account they were using, and waiting for people to contact us to complain. This is clearly not a great solution to a problem we shouldn’t have had in the first place, but it worked—mostly. However, we soon realized that most of these applications weren’t undergoing active maintenance, meaning there was no chance that they would be updated to reflect a new schema design.2 In effect, our database schema had become a public-facing contract that couldn’t change: we had to maintain that schema structure going forward.
Views to Present
Our solution was to first resolve those situations where external systems were writing into our schema. Luckily, in our case they were easy to resolve. For all those clients who wanted to read data, though, we created a dedicated schema hosting views that looked like the old schema, and had clients point at that schema instead, as Figure 4-3 shows. That allowed us to make changes in our own schema, as long as we could maintain the view. Let’s just say that lots of stored procedures were involved.
In our investment banking example, the view and the underlying schema ended up differing a fair amount. You can, of course, use a view much more simply, perhaps to hide pieces of information you don’t want made visible to outside parties. As a simple example, in Figure 4-4, our loyalty service just was a list of loyalty cards in our system. Presently, this information is stored in our customer table as a column. So we define a view that exposes just the customer ID and the loyalty ID mapping in a single table, without exposing any other information in the customer table. Likewise, any other tables that may be in the monolith’s database are entirely hidden from the Loyalty service.
The ability of a view to project only limited information from the underlying source allows us to implement a form of information hiding. It gives us control over what is shared, and what is hidden. This is not a perfect solution, however-there are restrictions with this approach.
Depending on the nature of the database, you may have the option to create a materialized view. With a materialized view, the view is precomputed—typically, through the use of a cache. This means a read from a view doesn’t need to generate a read on the underlying schema, which can improve performance. The trade-off then is around how this pre-computed view is updated; it may well mean you could be reading a “stale” set of data from the view.
How views are implemented can vary, but typically they are the result of a query. This means that the view itself is read-only. This immediately limits their usefulness. In addition, while this is a common feature for relational databases, and many of the more mature NoSQL databases support views (both Cassandra and Mongo do, for example), not all do. Even if your database engine does support views, there will likely be other limitations, such as the need for both the source schema and view to be in the same database engine. This could increase your physical deployment coupling, leading to a potential single point of failure.
It’s worth noting that changes to the underlying source schema may require the view to be updated, and so careful consideration should be given to who “owns” the view. I suggest considering any published database views to be akin to any other service interface, and therefore something that should be kept up-to-date by the team looking after the source schema.
Where to Use It
I typically make use of a database view in situations where I think it is impractical to decompose the existing monolithic schema. Ideally, you should try to avoid the need for a view if possible, if the end goal is to expose this information via a service interface. Instead, it’s better to push forward with proper schema decomposition. The limitations of this technique can be significant. Nonetheless, if you feel that the effort of full decomposition is too great, then this can be a step in the right direction.
Pattern: Database Wrapping Service
Sometimes, when something is too hard to deal with, hiding the mess can make sense. With the database wrapping service pattern, we do exactly that: hide the database behind a service that acts as a thin wrapper, moving database dependencies to become service dependencies, as we see in Figure 4-5.
I was working at a large bank in Australia a few years ago on a short engagement to help one part of the organization implement an improved path to production. On the first day, we set up a few interviews with key people to understand the challenges they were facing and to build up an overview of the current process. Between meetings, someone came up and introduced themselves as the head DBA for that area of the company. “Please” he said, “Stop them from putting things into the database!”
We grabbed a coffee, and the DBA laid out the problems. Over something like a 30-year period, a business banking system, one of the crown jewels of the organization, had taken shape. One of the more important parts of this system was managing what they called “entitlements.” In business banking, managing which individuals can access which accounts, and working out what they are allowed to do with those accounts, was very complex. To give you an idea of what these entitlements might look like, consider a bookkeeper who is allowed to view the accounts for companies A, B, and C, but for company B they can transfer up to $500 between accounts, and for company C they can make unlimited transfers between accounts but also withdraw up to $250. The maintenance and application of these entitlements were managed almost exclusively in stored procedures in the database. All data access was gated through this entitlement logic.
As the bank had scaled, and the amount of logic and state had grown, the database had started to buckle under the load. “We’ve given all the money it’s possible to give to Oracle, and it’s still not enough.” The worry was that given projected growth, and even counting for the improvements in performance of hardware, they would eventually get to a place where the needs of the organization would outstrip the capabilities of the database.
As we explored the problem further, we discussed the idea of splitting out parts of the schema to reduce load. The issue was that the tangled nest in the middle of all of this was this entitlements system. It would be a nightmare to try to untangle, and the risks associated with making mistakes in this area were huge: make a wrong step, and someone could be blocked from accessing their accounts, or worse still, someone could gain access to your money who shouldn’t.
We came up with a plan to try to resolve the situation. We accepted that in the near term, we weren’t going to be able to make changes to the entitlements system, so it was imperative that we at least not make the problem any worse. So we needed to stop people from putting more data and behavior into the entitlements schema. Once this was in place, we could consider removing those parts of the entitlements schema that were easier to extract, hopefully reducing the load enough that the concerns about the long-term viability were reduced. That would then give some breathing room to consider the next steps.
We discussed introducing a new Entitlements service, which would allow us to “hide” the problematic schema. This service would have very little behavior at first, as the current database already had implemented a lot of behavior in the form of stored procedures. But the goal was to encourage the teams writing the other applications to think of the entitlements schema as someone else’s, and encourage them to store their own data locally, as we see in Figure 4-6.
Just as with our use of database views, the use of a wrapping service allows us to control what is shared and what is hidden. It presents an interface to consumers that can be fixed, while changes are made under the hood to improve the situation.
Where to Use It
This pattern works really well where the underlying schema is just too hard to consider pulling apart. By placing an explicit wrapper around the schema, and making it clear that the data can be accessed only through that schema, you at the very least can put a brake on the database growing any further. It clearly delineates what is “yours” versus what is “someone else’s.” I think this approach also works best when you align ownership of both the underlying schema and the service layer to the same team. The service API needs to be properly embraced as a managed interface with appropriate oversight over how this API layer changes. This approach also has benefits for the upstream applications, as they can more easily understand how they are using the downstream schema. This makes activities like stubbing for test purposes much more manageable.
This pattern has advantages over the use of a simple database view. First, you aren’t constrained to presenting a view that can be mapped to existing table structures; you can write code in your wrapping service to present much more sophisticated projections on the underlying data. The wrapping service can also take writes (via API calls). Of course, adopting this pattern does require upstream consumers to make changes; they have to shift from direct DB access to API calls.
Ideally, using this pattern would be a stepping stone to more fundamental changes, giving you time to break apart the schema underneath your API layer. It could be argued we’re just putting a bandage over the problem, rather than addressing the underling issues. Nonetheless, in the spirit of making incremental improvement, I think this pattern has a lot going for it.
Pattern: Database-as-a-Service Interface
Sometimes, clients just need a database to query. It could be because they need to query or fetch large amounts of data, or perhaps because external parties are already using tool chains that require a SQL endpoint to work against (think about tools like Tableau, which are often used to gain insights into business metrics). In these situations, allowing clients to view data that your service manages in a database can make sense, but we should take care to separate the database we expose from the database we use inside our service boundary.
One approach I have seen work well is to create a dedicated database designed to be exposed as a read-only endpoint, and have this database populated when the data in the underlying database changes. In effect, in the same way that a service could expose a stream of events as one endpoint, and a synchronous API as another endpoint, it could also expose a database to external consumers. In Figure 4-7, we see an example of the Orders service, which exposes a read/write endpoint via an API, and a database as a read-only interface. A mapping engine takes changes in the internal database, and works out what changes need to be made in the external database.
The mapping engine could ignore the changes entirely, expose the change directly, or something in between. The key thing is that the mapping engine acts as an abstraction layer between the internal and external databases. When our internal database changes structure, the mapping engine will need to change to ensure that the public-facing database remains consistent. In virtually all cases, the mapping engine will lag behind writes made to the internal database; typically, the choice of mapping engine implementation will determine this lag. Clients reading from the exposed database need to understand that they are therefore seeing potentially stale data, and you may find it appropriate to programmatically expose information regarding when the external database was last updated.
Implementing a Mapping Engine
The detail here is in working out how to update—namely, how you implement the mapping engine. We’ve already looked at a change data capture system, which would be an excellent choice here. In fact, that is likely to be the most robust solution while also providing the most up-to-date view. Another option would be to have a batch process just copy the data over, although this can be problematic as it is likely to lead to a longer lag between internal and external databases, and determining which data needs to be copied over can be difficult with some schemas. A third option could be to listen to events fired from the service in question, and use that to update the external database.
In the past, I would have used a batch job to handle this. Nowadays, though, I’d probably utilize a dedicated change data capture system, perhaps something like Debezium. I’ve been bitten too many times by batch processes not running or taking too long to run. With the world moving away from batch jobs, and wanting data faster, batch is giving way to real time. Getting a change data capture system in place to handle this makes sense, especially if you are considering using it to expose events outside your service boundary.
Compared to Views
This pattern is more sophisticated than a simple database view. Database views are typically tied to a particular technology stack: if I want to present a view of an Oracle database, both the underlying database and the schema hosting the views both run on Oracle. With this approach, the database we expose can be a totally different technology stack. We could use Cassandra inside our service, but present a traditional SQL database as a public-facing endpoint.
This pattern gives more flexibility than database views, but at an added cost. If the needs of your consumers can be satisfied with a simple database view, this is likely to be less work to implement in the first instance. Just be aware that this may place restrictions on how this interface can evolve. You could start with the use of a database view and consider a shift to a dedicated reporting database later on.
Where to Use It
Obviously, as the database that is exposed as an endpoint is read-only, this is useful only for clients who need read-only access. It fits reporting use cases very well—situations where your clients may need to join across large amounts of data that a given service holds. This idea could be extended to then import this database’s data into a larger data warehouse, allowing for data from multiple services to be queried. I discuss this in more detail in Chapter 5 of Building Microservices.
Don’t underestimate the work required to ensure that this external database projection is kept properly up-to-date. Depending on how your current service is implemented, this might be a complex undertaking.
So far, we’ve really not tackled the underlying problem. We’ve just put a variety of different bandages on a big, shared database. Before we start considering the tricky task of pulling data out of the giant monolithic database, we need to consider where the data in question should actually live. As you split services out from the monolith, some of the data should come with you—and some of it should stay where it is.
If we embrace the idea of a microservice encapsulating the logic associated with one or more aggregates, we also need to move the management of their state and associated data into the microservice’s own schema. On the other hand, if our new microservice needs to interact with an aggregate that is still owned by the monolith, we need to expose this capability via a well-defined interface. Let’s look at these two options now.
Pattern: Aggregate Exposing Monolith
In Figure 4-8, our new Invoicing service needs to access a variety of information that isn’t directly related to managing invoicing. At the very least, it needs information on our current Employees to manage approval workflows. This data is currently all inside the monolith database. By exposing information about our Employees via a service endpoint (it could be an API or a stream of events) on the monolith itself, we make explicit what information the Invoice service needs.
We want to think of our microservices as combinations of behavior and state; I’ve already discussed the idea of thinking of microservices as containing one or more state machines that manage domain aggregates. When exposing an aggregate from the monolith, we want to think in the same terms. The monolith still “owns” the concept of what is and isn’t an allowable change in state; we don’t want to treat it just like a wrapper around a database.
Beyond just exposing data, we’re exposing operations that allow external parties to query the current state of an aggregate, and to make requests for new state transitions. We can still decide to restrict what state of an aggregate is exposed from our service boundary and to limit what state transition operations can be requested from the outside.
As a pathway to more services
By defining the needs of the Invoice service, and explicitly exposing the information needed in a well-defined interface, we’re on a path to potentially discovering future service boundaries. In this example, an obvious step might be to next extract an Employee service, as we see in Figure 4-9. By exposing an API to employee-related data, we’ve already gone a long way to understanding what the needs of the consumers of the new Employee service might be.
Of course, if we do extract those employees from the monolith, and the monolith needs that employee data, it may need to be changed to use this new service!
Where to use it
When the data you want to access is still “owned” by the database, this pattern works well to allow your new services the access they need. When extracting services, having the new service call back to the monolith to access the data it needs is likely little more work than directly accessing the database of the monolith—but in the long term is a much better idea. I’d consider using a database view over this approach only if the monolith in question cannot be changed to expose these new endpoints. In such cases, a database view on the monolith’s database could work, as could the previously discussed change data capture pattern (see the section “Pattern: Change Data Capture”), or creating a dedicated database wrapping service pattern (see the section “Pattern: Database Wrapping Service”) on top of the monolith’s schema, exposing the Employee information we want.
Where to use it
When the data you want to access is still “owned” by the database, this pattern works well to allow your new services the access they need. When extracting services, having the new service call back to the monolith to access the data it needs is likely little more work than directly accessing the database of the monolith—but in the long term is a much better idea. I’d consider using a database view over this approach only if the monolith in question cannot be changed to expose these new endpoints. In such cases, a database view on the monolith’s database could work, as could the previously discussed change data capture pattern (see the section “Pattern: Change Data Capture”), or creating a dedicated database wrapping service pattern (see the section “Pattern: Database Wrapping Service”) on top of the monolith’s schema, exposing the Employee information we want.
Pattern: Change Data Ownership
We’ve looked at what happens when our new Invoice service needs to access data that is owned by other functionality, as in the previous section, where we needed to access Employee data. However, what happens when we consider data that is currently in the monolith that should be under the control of our newly extracted service?
In Figure 4-10, we outline the change that needs to happen. We need to move our invoice-related data out of the monolith and into our new Invoice, as that is where the life cycle of the data should be managed. We’d then need to change the monolith to treat the Invoice service as the source of truth for invoice-related data, and change it such that it called out to the Invoice service endpoint to read the data or request changes.
Untangling the invoicing data from the existing monolithic database can be a complex problem, however. We may have to consider the impact of breaking foreign-key constraints, breaking transactional boundaries, and more—all topics we’ll be coming back to later in this chapter. If the monolith can be changed such that it needs only read access to Invoice-related data, you could consider projecting a view from the Invoice service’s database, as Figure 4-11 shows. All the limitations of database views will apply, however; changing the monolith to make calls to the new Invoice service directly is greatly preferred.
Where to use it
This one is a little more clear-cut. If your newly extracted service encapsulates the business logic that changes some data, that data should be under the new service’s control. The data should be moved from where it is, over into the new service. Of course, the process of moving data out of an existing database is far from a simple process. In fact, this will be the focus of the remainder of this chapter.
As we discussed in Chapter 3, one of the benefits of something like a strangler fig pattern is that when we switch over to the new service, we can then switch back if there is an issue. The problem occurs when the service in question manages data that will need to be kept in sync between both the monolith and the new service.
In Figure 4-12, we see an example of this. We are in the process of switching over to a new Invoice service. But the new service, and the existing equivalent code in the monolith also manages this data. To maintain the ability to switch between implementations, we need to ensure that both sets of code can see the same data, and that this data can be maintained in a consistent way.
So what are our options here? Well, first, we need to consider the degree to which the data needs to be consistent between the two views. If either set of code needs to always see a totally consistent view of invoice data, one of the most straightforward approaches would be to ensure the data is kept in one place. This would lead us toward probably having our new Invoice service read its data directly from the monolith for a short space of time, perhaps making use of a view, as we explored in the section “Pattern: Database View”. Once we are happy that the switch has been successful, we can then migrate the data, as we discussed earlier in the section “Pattern: Change Data Ownership”. However, the concerns about using a shared database cannot be overstated: you should consider this only as a very short-term measure, as part of a more complete extraction; leaving a shared database in place for too long can lead to significant long-term pain.
If we were doing a big-bang switchover (something I’d try to avoid), migrating both the application code and the data at the same time, we could use a batch process to copy the data over in advance of switching to the new microservice. Once the invoice-related data has been copied over into our new microservice, it can start serving traffic. However, what happens if we need to fall back to using the functionality in the existing monolithic system? Data changed in the microservices’ schema will not be reflected in the state of the monolithic database, so we could end up losing state.
Another approach could be to consider keeping the two databases in sync via our code. So we would have either the monolith or the new Invoice service make writes to both databases. This involves some careful thought.
Pattern: Synchronize Data in Application
Switching data from one location to another can be a complex undertaking at the best of times, but it can be even more fraught the more valuable the data is. When you start thinking about looking after medical records, thinking carefully about how you migrate data is even more important.
Several years ago, the consultancy Trifork was involved in a project to help store a consolidated view of Danish citizens’ medical records.3 The initial version of this system had stored the data in a MySQL database, but over time it became clear that this may not be suitable for the challenges the system would face. A decision was made to use an alternative database, Riak. The hope was that Riak would allow the system to better scale to handle expected load, but would also offer improved resiliency characteristics.
An existing system stored data in one database, but there were limits to how long the system could be offline, and it was vital that data wasn’t lost. So a solution was needed that allowed the company to move the data to a new database, but also build in mechanisms to verify the migration, and have fast rollback mechanisms along the way.
The decision was made that the application itself would perform the synchronization between the two data sources. The idea is that initially the existing MySQL database would remain the source of truth, but for a period of time the application would ensure that data in MySQL and Riak were kept in sync. After a period of time, Riak would move to being the source of truth for the application, prior to MySQL being retired. Let’s look at this process in a bit more detail.
Step 1: Bulk Synchronize Data
The first step is to get to the point where you have a copy of the data in the new database. For the medical record project, this involved doing a batch migration of data from the old system into the new Riak database. While the batch import was going on, the existing system was kept running, so the source of data for the import was a snapshot of data taken from the existing MySQL system (Figure 4-13). This causes a challenge, as when the batch import finishes, the data in the source system could well have changed. In this case, however, it wasn’t practical to take the source system offline.
Once the batch import completed, a change data capture process was implemented whereby changes since the import could be applied. This allowed Riak to be brought in sync. Once this was achieved, it was time to deploy the new version of the application.
Step 2: Synchronize on Write, Read from Old Schema
With both databases now in sync, a new version of the application was deployed that would write all data to both databases, as we see in Figure 4-14. At this stage, the goal was to ensure that the application was correctly writing to both sources and make sure that Riak was behaving within acceptable tolerances. By still reading all data from MySQL, this ensured that even if Riak fell over in a heap, data could still be retrieved from the existing MySQL database.
Only once enough confidence had been built up in the new Riak system did they move to the next step.
Step 3: Synchronize on Write, Read from New Schema
At this stage, it’s been verified that reads to Riak are working fine. The last step is to make sure that reads work too. A simple change to the application now has Riak as being the source of truth, as we see in Figure 4-15. Note that we still write to both databases, so if there is an issue, you have a fallback option.
Once the new system has bedded in enough, the old schema could be safely removed.
Where to Use This Pattern
With the Danish medical record system, we had a single application to deal with. But we’ve been talking about situations where we are looking to split out microservices. So does this pattern really help? The first thing to consider is that this pattern may make a lot of sense if you want to split the schema before splitting out the application code. In Figure 4-16, we see exactly such a situation, where we duplicate the invoice-related data first.
If implemented correctly, both data sources should always be in sync, offering us significant benefits in situations where we need fast switching between sources for rollback scenarios, etc. The use of this pattern in the example of the Danish medical records system seems sensible because of the inability to take the application offline for any length of time.
Where to Use It
Now you could consider using this pattern where you have both your monolith and microservice accessing the data, but this gets extremely complicated. In Figure 4-17, we have such a situation. Both the monolith and microservice have to ensure proper synchronization across the databases for this pattern to work. If either one makes a mistake, you could be in trouble. This complexity is greatly mitigated if you can be sure that at any point in time either the Invoice service is making writes, or the monolith’s Invoice functionality is—which would work well if using a simple switchover technique, as we discussed with the strangler fig pattern. If, however, requests could hit either the monolith’s Invoice functionality or the new Invoice functionality, perhaps as part of a canary, then you may not want to use this pattern, as the resulting synchronization will be tricky.
Pattern: Tracer Write
The tracer write pattern, outlined in the section Figure 4-18, is arguably a variation of the synchronize data in application pattern (see the section “Pattern: Synchronize Data in Application”. With a tracer write, we move the source of truth for data in an incremental fashion, tolerating there being two sources of truth during the migration. You identify a new service that will host the relocated data. The current system still maintains a record of this data locally, but when making changes also ensures this data is written to the new service via its service interface. Existing code can be changed to start accessing the new service, and once all functionality is using the new service as the source of truth, the old source of truth can be retired. Careful consideration needs to be given regarding how data is synchronized between the two sources of truth.
Wanting a single source of truth is a totally rational desire. It allows us to ensure consistency of data, to control access to that data, and can reduce maintenance costs. The problem is that if we insist on only ever having one source of truth for a piece of data, then we are forced into a situation that changing where this data lives becomes a single big switchover. Before the release, the monolith is the source of truth. After the release, our new microservice is the source of truth. The issue is that various things can go wrong during this change over. A pattern like the tracer write allows for a phased switchover, reducing the impact of each release, in exchange for being more tolerant of having more than one source of truth.
The reason this pattern is called a tracer write is that you can start with a small set of data being synchronized and increase this over time, while also increasing the number of consumers of the new source of data. If we take the example outlined in Figure 4-12, where invoice-related data was being moved from the monolith over to our new Invoice microservice, we could first synchronize the basic invoice data, then migrate the contact information for the invoice, and finally synchronize payment records, as outlined in Figure 4-19.
Other services that wanted invoice-related information would have a choice to source this from either the monolith or the new service itself, depending on what information they need. If they still needed information available only in the monolith, they would have to wait until that data and the supporting functionality was moved. Once the data and functionality are available in the new microservice, the consumers can switch to the new source of truth.
The goal in our example is to migrate all consumers to use the Invoice service, including the monolith itself. In Figure 4-20, we see an example of a couple of stages during the migration. Initially, we’re writing only basic invoice information to both sources of truth. Once we’ve established that this information is being properly synchronized, the monolith can start to read its data from the new service. As more data is synchronized, the monolith can use the new service as a source of truth for more and more of the data. Once all the data is synchronized, and the last consumer of the old source of truth has been switched over, we can stop synchronizing the data.
- Write to one source
All writes are sent to one of the sources of truth. Data is synchronized to the other source of truth after the write occurs.
- Send writes to both sources
All write requests made by upstream clients are sent to both sources of truth. This occurs by making sure the client makes a call to each source of truth itself, or by relying on an intermediary to broadcast the request to each downstream service.
- Seed writes to either source
Clients can send write requests to either source of truth, and behind the scenes the data is synchronized in a two-way fashion between the systems.
The two separate options of sending writes to both sources of truth, or sending to one source of truth and relying on some form of background synchronization, seem like workable solutions, and the example we’ll explore in a moment uses both of these techniques. However, although it’s technically an option, this situation—where writes are made to either one source of truth or the other—should be avoided, as it requires two-way synchronization (something that can be very difficult to achieve).
In all of these cases, there will be some delay in the data being consistent in both sources of truth. The duration of this window of inconsistency will depend on several factors. For example, if you use a nightly batch process to copy updates from one source to another, the second source of truth could contain data that is up to 24 hours out-of-date. If you are constantly streaming updates from one system to another, perhaps using a change data capture system, the windows of inconsistency could be measured in seconds or less.
However long this window of inconsistency is, such synchronization gives us what is called eventual consistency—eventually, both sources of truth will have the same data. You will have to understand what period of inconsistency is appropriate in your case, and use that to drive how you implement the synchronization.
It’s important that when maintaining two sources of truth like this that you have some kind of reconciliation process to ensure that the synchronization is working as intended. This may be something as simple as a couple of SQL queries you can run against each database. But without checking that the synchronization is working as expected, you may end up with inconsistencies between the two systems and not realize it until it is too late. Running your new source of truth for a period of time when it has no consumers until you are satisfied with how things are working—which, as we’ll explore in the next section, is something that Square did, for example—is very sensible.
Example: Orders at Square
This pattern was originally shared with me by Derek Hammer, a developer at Square, and since then I’ve found other examples of this pattern being used in the wild.4 He detailed its usage to help untangle part of Square’s domain related to ordering take-out food for delivery. In the initial system, a single Order concept was used to manage multiple workflows: one for customers ordering food, another for the restaurant preparing the food, and a third workflow-managed state related to delivery drivers picking up the food and dropping it off to customers. The needs of the three stakeholders are different, and although all these stakeholders work with the same Order, what that Order means to each of them is different. For the customer, it’s something they have chosen to be delivered, and something they need to pay for. For a restaurant it’s something that needs to be cooked and picked up. And for the delivery driver, it’s something that needs to be taken from the restaurant to the customer in a timely manner. Despite these different needs, the code and associated data for the order was all bound together.
Having all these workflows bundled into this single Order concept was the source of a great degree of what I’ve previously referred to as delivery contention—different developers trying to make changes for different use cases would get in each other’s way, as they all needed to make changes in the same part of the codebase. Square wanted to break apart an Order so changes to each workflow could be made in isolation, and also enable different scaling and robustness needs.
Creating the new service
The first step was to create a new Fulfillments service as seen in Figure 4-21, which would manage the Order data associated with restaurant and delivery drivers. This service would become the new source of truth going forward for this subset of the Order data. Initially, this service just exposed functionality to allow for Fulfillments-related entities to be created. Once the new service was live, the company had a background worker copy the Fulfillments-related data from the existing system to the new Fulfillments service. This background worker just made use of an API exposed by the Fulfillments service rather than doing a direct insertion into the database, avoiding the need for direct database access.
The background worker was controlled via a feature flag that could be enabled or disabled to stop this copying process. This allowed them to ensure that if the worker caused any issues in production, the process would be easy to turn off. They ran this system in production for sufficient time to be confident that the synchronization was working correctly. Once they were happy that the background worker was working as expected, they removed the feature flag.
Synchronizing the data
One of the challenges with this sort of synchronization is that it is one way. Changes made to the existing system resulted in the fulfillment-related data being written to the new Fulfillments service via its API. Square resolved this by ensuring that all updates were made to both systems, as in Figure 4-22. Not all updates needed to be made to both systems, however. As Derek explained, now that the Fulfillments service represented only a subset of the Order concept, only changes made to the order that delivery or restaurant clients cared about needed to be copied.
Any code that changed restaurant- or delivery-oriented information needed to be changed to make two sets of API calls—one to the existing system, the other to the same microservice. These upstream clients would also need to handle any error conditions if the write to one worked but the other failed. These changes to the two downstream systems (the existing order system and the new Fulfillments service) were not done in an atomic fashion. This means there could be a brief window in which a change would be visible in one system, but not yet the other. Until both changes have been applied, you can see an inconsistency between the two systems; this is a form of eventual consistency, which we discussed earlier.
In terms of the eventual consistent nature of the Order information, this wasn’t a problem for this particular use case. Data was synchronized quickly enough between the two systems that it didn’t impact the users of the system.
If Square had been using an event-driven system for managing Order updates, rather than making use of API calls, they could have considered an alternative implementation. In Figure 4-23, we have a single stream of messages that could trigger changes in the Order state. Both the existing system and the new Fulfillments service receive the same messages. Upstream clients don’t need to know that there are multiple consumers for these messages; this is something that could be handled through the use of a pub-sub style broker.
Retrofitting Square’s architecture to be event-based just to satisfy this sort of use case would be a lot of work. But if you are already making use of an event-based system, you may have an easier time managing the synchronization process. It’s also worth noting that such an architecture would still exhibit eventually consistent behavior, as you cannot guarantee that both the existing system and Fulfillments service would process the same event at the same time.
With the new Fulfillments service now holding all the required information for the restaurant and delivery driver workflows, code that managed those workflows could start switching over to use the new service. During this migration, more functionality can be added to support these consumers’ needs; initially, the Fulfillments service needed only to implement an API that enabled creation of new records for the background worker. As new consumers migrate, their needs can be assessed and new functionality can be added to the service to support them.
This incremental migration of both data, as well as changing the consumers to use the new source of truth, proved in the case of Square to be highly effective. Derek said that getting to the point where all consumers had switched over ended up being pretty much a non-event. It was just another small change done during a routine release (another reason I’ve been advocating for incremental migration patterns so strongly in this book!).
From a domain-driven design point of view, you could certainly argue that the functionality associated with delivery drivers, customers, and restaurants all represented different bounded contexts. From that viewpoint, Derek suggested that he would ideally have considered splitting this Fulfillments service further, into two services—one for Restaurants, and another for the Delivery Drivers. Nonetheless, although there is scope for further decomposition, this migration seemed to be very successful.
In Square’s case, it decided to keep the duplicated data. Leaving restaurant- and delivery-related order information in the existing system allowed the company to provide visibility of this information in the event of the Fulfillments service being unavailable. This requires keeping the synchronization in place, of course. I do wonder if this will be revisited over time. Once there is sufficient confidence in the availability of the Fulfillments service, removing the background worker and need for consumers to make two sets of update calls may well help streamline the architecture.
Where to Use It
Implementation of the synchronization is likely to be where most of the work lies. If you can avoid the need for two-way synchronization, and instead use some of the simpler options outlined here, you’ll likely find this pattern much easier to implement. If you are already making use of an event-driven system, or have a change data capture pipeline available, then you probably already have a lot of the building blocks available to you to get the synchronization working.
Careful thought does need to be given regarding how long you can tolerate inconsistency between the two systems. Some use cases might not care, others may want the replication to be almost immediate. The shorter the window of acceptable inconsistency, the more difficult this pattern will be to implement.
Splitting Apart the Database
We’ve already discussed at length the challenges of using databases as a method of integrating multiple services. As should by now be very clear, I am not a fan! This means we need to find seams in our databases too so we can split them out cleanly. Databases, however, are tricky beasts. Before we get into some examples of approaches, we should briefly discuss how logical separation and physical deployment may be related.
Physical Versus Logical Database Separation
When we talk about breaking apart our databases in this context, we’re primarily trying to achieve logical separation. A single database engine is perfectly capable of hosting more than one logically separated schema, as we see in Figure 4-24.
We could take this further, and have each logical schema also on separate database engines, giving us physical separation too, as we see in Figure 4-25.
Why would you want to logically decompose your schemas but still have them on a single database engine? Well, fundamentally, logical and physical decomposition achieve different goals. Logical decomposition allows for simpler independent change and information hiding, whereas physical decomposition potentially improves system robustness, and could help remove resource contention allowing for improved throughput or latency.
When we logically decompose our database schemas but keep them on the same physical database engine, as in Figure 4-24, we have a potential single point of failure. If the database engine goes down, both services are affected. However, the world isn’t that simple. Many database engines have mechanisms to avoid single points of failure, such as multiprimary database modes, warm failover mechanisms, and the like. In fact, significant effort may have been put into creating a highly resilient database cluster in your organization, and it may be hard to justify having multiple clusters because of the time, effort, and cost that may be involved (those pesky license fees can add up!).
Another consideration is that having multiple schemas sharing the same database engine may be required if you want to expose views of your database. Both the source database and the schemas hosting the views may need to be located on the same database engine.
Of course, for you to even have the option of running separate services on different physical database engines, you need to have already logically decomposed their schemas!
Splitting the Database First, or the Code?
So far, we’ve spoken about patterns to help work with shared databases, and hopefully move on to less coupled models. In a moment, we need to look in detail at patterns around database decomposition. Before we do that, though, we need to discuss sequencing. Extracting a microservice isn’t “done” until the application code is running in its own service, and the data it controls is extracted into its own logically isolated database. But with this being a book largely about enabling incremental change, we have to explore a little how this extraction should be sequenced. We have a few options:
Split the database first, then the code.
Split the code first, then the database.
Split them both at once.
Each has its pros and cons. Let’s look at these options now, along with some patterns that may help, depending on the approach you take.
Split the Database First
With a separate schema, we’ll be potentially increasing the number of database calls to perform a single action. Whereas before we might have been able to have all the data we wanted in a single SELECT statement, now we may need to pull the data back from two locations and join in memory. Also, we end up breaking transactional integrity when we move to two schemas, which could have significant impact on our applications; we’ll be discussing these challenges later in this chapter, as we cover topics like distributed transactions and sagas, and how they might be able to help. By splitting the schemas out but keeping the application code together, as in Figure 4-26, we give ourselves the ability to revert our changes or continue to tweak things without impacting any consumers of our service if we realize we’re heading down the wrong path. Once we are satisfied that the DB separation makes sense, we could then think about splitting out the application code into two services.
The flip side is that this approach is unlikely to yield much short-term benefit. We still have a monolithic code deployment. Arguably, the pain of a shared database is something you feel over time, so we’re spending time and effort now to give us return in the long run, without getting enough of the short-term benefit. For this reason, I’d likely go this route only if I’m especially concerned about the potential performance or data consistency issues. We also need to consider that if the monolith itself is a black-box system, like a piece of commercial software, this option isn’t available to us.
Pattern: Repository per bounded context
A common practice is to have a repository layer, backed by some sort of framework like Hibernate, to bind your code to the database, making it easy to map objects or data structures to and from the database. Rather than having a single repository layer for all our data access concerns, there is value in breaking down these repositories along the lines of bounded contexts, as shown in Figure 4-27.
Having the database mapping code colocated inside the code for a given context can help us understand what parts of the database are used by what bits of code. Hibernate, for example, can make this very clear if you are using something like a mapping file per bounded context. We can see, therefore, which bounded contexts access which tables in our schema. This can help us greatly understand what tables need to move as part of any future decomposition.
This doesn’t give us the whole story, however. For example, we may be able to tell that the finance code uses the ledger table, and that the catalog code uses the line item table, but it might not be clear that the database enforces a foreign-key relationship from the ledger table to the line item table. To see these database-level constraints, which may be a stumbling block, we need to use another tool to visualize the data. A great place to start is to use a tool like the freely available SchemaSpy, which can generate graphical representations of the relationships between tables.
All this helps you understand the coupling between tables that may span what will eventually become service boundaries. But how do you cut those ties? And what about cases where the same tables are used from multiple bounded contexts? We’re going to explore this topic in great length later in this chapter.
Where to use it
This pattern is really effective in any situation where you are looking to rework the monolith in order to better understand how to split it apart. Breaking down these repository layers along the lines of domain concepts will go a long way to helping you understand where seams for microservices may exist not only in your database, but also in the code itself.
Pattern: Database per bounded context
Once you’ve clearly isolated data access from the application point of view, it makes sense to continue this approach into the schema. Central to the idea of microservices’ independent deployability is the fact that they should own their own data. Before we get to separating out the application code, we can start this decomposition by clearly separating our databases around the lines of our identified bounded contexts.
At ThoughtWorks, we were implementing some new mechanisms to calculate and forecast revenue for the company. As part of this, we’d identified three broad areas of functionality that needed to be written. I discussed the problem with the lead for this project, Peter Gillard-Moss. Peter explained that the functionality felt quite separate, but that he had concerns regarding the extra work that having this functionality would bring if they were put in separate microservices. At this time, his team was small (only three people), and it was felt that the team couldn’t justify splitting out these new services. In the end, they settled on a model in which the new revenue functionality was effectively deployed as a single service, containing three isolated bounded contexts (each of which ended up as separate JAR files), as shown in Figure 4-28.
Each bounded context had its own, totally separate databases. The idea was that if there was a need to separate them into microservices later, this would be much easier. However, it turned out that this was never needed. Several years later, this revenue service remains as it is, a monolith with multiple associated databases—a great example of a modular monolith.
Where to use it
At first glance, the extra work in maintaining the separate databases doesn’t make much sense if you keep things as a monolith. I view this as a pattern that is all about hedging your bets. It’s a bit more work than a single database, but keeps your options open regarding moving to microservices later. Even if you never move to microservices, having the clear separation of schema backing the database can really help, especially if you have lots of people working on the monolith itself.
This is a pattern I nearly always recommend for people building brand-new systems (as opposed to reimplementing an existing system). I’m not a fan of implementing microservices for new products or startups; your understanding of the domain is unlikely to be mature enough to identify stable domain boundaries. With startups especially, the nature of the thing you are building can change drastically. This pattern can be a nice halfway point, though. Keep schema separation where you think you may have service separation in the future. That way, you get some of the benefits of decoupling these ideas, while reducing the complexity of the system.
Split the Code First
In general, I find that most teams split the code first, then the database, as shown in Figure 4-29. They get the short-term improvement (hopefully) from the new service, which gives them confidence to complete the decomposition by separating out the database.
By splitting out the application tier, it becomes much easier to understand what data is needed by the new service. You also get the benefit of having an independently deployable code artifact earlier. The concerns I’ve always had with this approach is that teams may get this far and then stop, leaving a shared database in play on an ongoing basis. If this is the direction you take, you have to understand that you’re storing up trouble for the future if you don’t complete the separation into the data tier. I’ve seen teams that have fallen into this trap, but can happily also report speaking to organizations that have done the right thing here. JustSocial is one such organization that used this approach as part of its own microservices migration. The other potential challenge here is that you may be delaying finding out nasty surprises caused by pushing join operations up into the application tier.
If this is the direction you take, be honest with yourself: are you confident that you will be able to make sure that any data owned by the microservice gets split out as part of the next step?
Pattern: Monolith as data access layer
Rather than accessing the data from the monolith directly, we can just move to a model in which we create an API in the monolith itself. In Figure 4-30, the Invoice service needs information about employees in our Customer service, so we create an Employee API allowing for the Invoice service to access them. Susanne Kaiser from JustSocial shared this pattern with me as that company had used successfully as part of its microservices migration. The pattern has so many things going for it that I’m surprised it doesn’t seem as well-known as it should be.
Part of the reason this isn’t used more widely is likely because people sort of have in their minds the idea that the monolith is dead, and of no use. They want to move away from it. They don’t consider making it more useful! But the upsides here are obvious: we don’t have to tackle data decomposition (yet) but get to hide information, making it easier to keep our new service isolated from the monolith. I’d be more inclined to adopt this model if I felt that the data in the monolith was going to stay there. But it can work well, especially if you think that your new service will effectively be pretty stateless.
It’s not too hard to see this pattern as a way of identifying other candidate services. Extending this idea, we could see the Employee API splitting out from the monolith to become a microservice in its own right, as shown in Figure 4-31.
Where to use it
This pattern works best when the code managing this data is still in the monolith. As we talked about previously, one way to think of a microservice when it comes to data is the encapsulation of the state and the code that manages the transitions of that state. So if the state transitions of this data are still provided in the monolith, it follows that the microservice that wants to access (or change) that state needs to go via the state transitions in the monolith.
If the data you’re trying to access in the monolith’s database should really be “owned” by the microservice instead, I’m more inclined to suggest skipping this pattern and instead looking to split the data out.
Pattern: Multischema storage
As we’ve already discussed, it’s a good idea not to make a bad situation any worse. If you are still making direct use of the data in a database, it doesn’t mean that new data stored by a microservice should go in there too. In Figure 4-32, we see an example of our Invoice service. The invoice core data still lives in the monolith, which is where we (currently) access it from. We’ve added the ability to add reviews to Invoices; this represents brand-new functionality not in the monolith. To support this, we need to store a table of reviewers, mapping employees to Invoice IDs. If we put this new table in the monolith, we’d be helping grow the database! Instead, we’ve put this new data in our own schema.
In this example, we have to consider what happens when a foreign-key relationship effectively spans a schema boundary. Later in this chapter, we’ll explore this very problem in more depth.
Pulling out the data from a monolithic database will take time, and may not be something you can do in one step. You should therefore feel quite happy to have your microservice access data in the monolithic DB while also managing its own local storage. As you manage to drag clear the rest of the data from the monolith, you can migrate it a table at a time into your new schema.
Where to use it
This pattern works well when adding brand-new functionality to your microservice that requires the storage of new data. It’s clearly not data the monolith needs (the functionality isn’t there), so keep it separate from the beginning. This pattern also makes sense as you start moving data out of the monolith into your own schema—a process that may take some time.
If the data you are accessing in the monolith’s schema is data that you never planned to move into your schema, I strongly recommend use of the monolith as data access layer pattern (see the section “Pattern: Monolith as data access layer”) in conjunction with this pattern.
Split Database and Code Together
From a staging point of view, of course, we have the option to just break things apart in one big step as in Figure 4-33. We split both the code and data at once.
My concern here is that this is a much bigger step to take, and it will be longer before you can assess the impact of your decision as a result. I strongly suggest avoiding this approach, and instead splitting either the schema or application tier first.
So, Which Should I Split First?
I get it: you’re probably tired of all this “It depends” stuff, right? I can’t blame you.The problem is, everyone’s situation is different, so I’m trying to arm you with enough context and discussion of various pros and cons to help you make up your own mind. However, I know sometimes what people want is a bit of a hot take on these things, so here it is.
If I’m able to change the monolith, and if I am concerned about the potential impact to performance or data consistency, I’ll look to split the schema apart first. Otherwise, I’ll split the code out, and use that to help understand how that impacts data ownership. But it’s important that you also think for yourself and take into account any factors that might impact the decision-making process in your particular situation.
Schema Separation Examples
So far, we’ve looked at schema separation at a fairly high level, but there are complex challenges associated with database decomposition and a few tricky issues to explore. We’re going to look at some more low-level data decomposition patterns now and explore the impact they can have.
Pattern: Split Table
Sometimes you’ll find data in a single table that needs to be split across two or more service boundaries, and that can get interesting. In Figure 4-34, we see a single shared table, Item, where we store information about not only the item being sold, but also the stock levels.
In this example, we want to split out Catalog and Warehouse as new services, but the data for both is mixed into this single table. So, we need to split this data into two separate tables, as Figure 4-34 shows. In the spirit of incremental migration, it may make sense to split the tables apart in the existing schema, before separating the schemas. If these tables existed in a single schema, it would make sense to declare a foreign-key relationship from the Stock Item SKU column to the Catalog Item table. However, because we plan to move these tables ultimately into separate databases, we may not gain much from this as we won’t have a single database enforcing the data consistency (we’ll explore this idea in more detail shortly).
This example is fairly straightforward; it was easy to separate ownership of data on a column-by-column basis. But what happens when multiple pieces of code update the same column? In Figure 4-35, we have a Customer table, which contains a Status column.
This column is updated during the customer sign-up process to indicate that a given person has (or hasn’t) verified their email, with the value going from NOT_VERIFIED → VERIFIED. Once a customer is VERIFIED, they are able to shop. Our finance code handles suspending customers if their bills are unpaid, so they will on occasion change the status of a customer to SUSPENDED. In this instance, a customer’s Status still feels like it should be part of the customer domain model, and as such it should be managed by the soon-to-be-created Customer service. Remember, we want, where possible, to keep the state machines for our domain entities inside a single service boundary, and updating a Status certainly feels like part of the state machine for a customer! This means that when the service split has been made, our new Finance service will need to make a service call to update this status, as we see in Figure 4-36.
A big problem with splitting tables like this is that we lose the safety given to us by database transactions. We’ll explore this topic in more depth toward the end of this chapter, when we look at “Transactions” and “Sagas”.
Where to Use It
On the face of it, these seem pretty straightforward. When the table is owned by two or more bounded contexts in your current monolith, you need to split the table along those lines. If you find specific columns in that table that seem to be updated by multiple parts of your codebase, you need to make a judgment call as to who should “own” that data. Is it an existing domain concept you have in scope? That will help determine where this data should go.
Pattern: Move Foreign-Key Relationship to Code
We’ve decided to extract our Catalog service—something that can manage and expose information about artists, tracks, and albums. Currently, our catalog-related code inside the monolith uses an Albums table to store information about the CDs which we might have available for sale. These albums end up getting referenced in our Ledger table, which is where we track all sales. You can see this in Figure 4-37. The rows in the Ledger table just record the amount we receive for a CD, along with an identifier that refers to the item sold; the identifier in our example is called a SKU (a stock keeping unit), a common practice in retail systems.
At the end of each month, we need to generate a report outlining our biggest-selling CDs. The Ledger table helps us understand which SKU sold the most copies, but the information about that SKU is over in the Albums table. We want to make the reports nice and easy to read, so rather than saying, “We sold 400 copies of SKU 123 and made $1,596,” we’d like to add more information about what was sold, instead saying, “We sold 400 copies of Bruce Springsteen’s Born to Run and made $1,596.” To do this, the database query triggered by our finance code needs to join information from the Ledger table to the Albums table, as Figure 4-37 shows.
We have defined a foreign-key relationship in our schema, such that a row in the Ledger table is identified as having a relationship to a row in the Albums table. By defining such a relationship, the underlying database engine is able to ensure data consistency—namely, that if a row in the Ledger table refers to a row in the Albums table, we know that row exists. In our situation, that means we can always get information about the album that was sold. These foreign-key relationships also let the database engine carry out performance optimizations to ensure that the join operation is as fast as possible.
We want to split out the Catalog and Finance code into their own corresponding services, and that means the data has to come too. The Albums and Ledger tables will end up living in different schemas, so what happens with our foreign-key relationship? Well, we have two key problems to consider. First, when our new Finance service wants to generate this report in the future, how does it retrieve Catalog-related information if it can no longer do this via a database join? The other problem is, what do we do about the fact that data inconsistency could now exist in the new world?
Moving the Join
Let’s look at replacing the join first. With a monolithic system, in order to join a row from the Album table with the sales information in the Ledger, we’d have the database perform the join for us. We’d perform a single SELECT query, where we’d join to the Albums table. This would require a single database call to execute the query and pull back all the data we need.
In our new microservice-based world, our new Finance service has the responsibility of generating the best-sellers report, but doesn’t have the album data locally. So it will need to fetch this from our new Catalog service, as we see in Figure 4-38. When generating the report, the Finance service first queries the Ledger table, extracting the list of best-selling SKUs for the last month. At this point, all we have is a list of SKUs, and the number of copies sold for each; that’s the only information we have locally.
Next, we need to call the Catalog service, requesting information on each of these SKUs. This request in turn would cause the Catalog service to make its own local SELECT on its own database.
Logically, the join operation is still happening, but it is now happening inside the Finance service, rather than in the database. Unfortunately, it isn’t going to be anywhere near as efficient. We’ve gone from a world where we have a single SELECT statement, to a new world where we have a SELECT query against the Ledger table, followed by a service call to the Catalog service, which in turn triggers a SELECT statement against the Albums table, as we see in Figure 4-38.
In this situation, I’d be very surprised if the overall latency of this operation didn’t increase. This may not be a significant problem in this particular case, as this report is generated monthly, and could therefore be aggressively cached. But if this is a frequent operation, that could be more problematic. We can mitigate the likely impact of this increase in latency by allowing for SKUs to be looked up in the Catalog service in bulk, or perhaps even by caching the required album information locally.
Ultimately, whether or not this increase in latency is a problem is something only you can decide. You need to have an understanding of acceptable latency for key operations, and be able to measure what the latency currently is. Distributed systems like Jaeger can really help here, as they provide the ability to get accurate timing of operations that span multiple services. Making an operation slower may be acceptable if it is still fast enough, especially if as a trade-off you gain some other benefits.
A trickier consideration is that with Catalog and Finance being separate services, with separate schemas, we may end up with data inconsistency. With a single schema, I wouldn’t be able to delete a row in the Albums table if there was a reference to that row in the Ledger table. My schema was enforcing data consistency. In our new world, no such enforcement exists. What are our options here?
Check before deletion
Our first option might be to ensure that when removing a record from the Albums table, we check with the Finance service to ensure that it doesn’t already have a reference to the record. The problem here is that guaranteeing we can do this correctly is difficult. Say we want to delete SKU 683. We send a call to Finance saying, “Are you using 683?” It responds that this record is not used. We then delete the record, but while we are doing it, a new reference to 683 gets created in the Finance system. To stop this from happening, we’d need to stop new references being created on record 683 until the deletion has happened—something that would likely require locks, and all the challenges that implies in a distributed system.
Another issue with checking if the record is already in use is that creates a de facto reverse dependency from the Catalog service. Now we’d need to check with any other service that uses our records. This is bad enough if we have only one other service using our information, but becomes significantly worse as we have more consumers.
I strongly urge you not to consider this option, because of the difficulty in ensuring that this operation is implemented correctly as well as the high degree of service coupling that this introduces.
Handle deletion gracefully
A better option is just to have the Finance service handle the fact that the Catalog service may not have information on the Album in a graceful way. This could be as simple as having our report show “Album Information Not Available” if we can’t look up a given SKU.
In this situation, the Catalog service could tell us when we request a SKU that used to exist. This would be the good use of a
410 GONE response code if using HTTP, for example. A 410 response code differs from the commonly used 404. A 404 denotes that the requested resource is not found, whereas a 410 means that the requested resource was available but isn’t any longer. The distinction can be important, especially when tracking down data inconsistency issues! Even if not using an HTTP-based protocol, consider whether or not you’d benefit from supporting this sort of response.
If we wanted to get really advanced, we could ensure that our Finance service is informed when a Catalog item is removed, perhaps by subscribing to events. When we pick up a Catalog deletion event, we could decide to copy the now deleted Album information into our local database. This feels like overkill in this particular situation, but could be useful in other scenarios, especially if you wanted to implement a distributed state machine to perform something like a cascading deletion across service boundaries.
Don’t allow deletion
One way to ensure that we don’t introduce too much inconsistency into the system could be to simply not allow records in the Catalog service to be deleted. If in the existing system deleting an item was akin to ensuring it wasn’t available for sale or something similar, we could just implement a soft delete capability. We could do this by using a status column to mark that row as being unavailable, or perhaps even moving the row into a dedicated “graveyard” table.5 The album’s record could still be requested by the Finance service in this situation.
So how should we handle deletion?
Basically, we have created a failure mode that couldn’t exist in our monolithic system. In looking at ways to solve this, we must consider the needs of our users, as different solutions could impact our users in different ways. Choosing the right solution therefore requires an understanding of your specific context.
Personally, in this specific situation, I’d likely solve this in two ways: by not allowing deletion of album information in the Catalog, and by ensuring that the Finance service can handle a missing record. You could argue that if a record can’t be removed from the Catalog service, the lookup from Finance could never fail. However, there is a possibility that, as a result of corruption, the Catalog service may be recovered to an earlier state, meaning the record we are looking for no longer exists. In that situation, I wouldn’t want the Finance service to fall over in a heap. It seems an unlikely set of circumstances, but I’m always looking to build in resiliency, and consider what happens if a call fails; handling this gracefully in the Finance service seems a pretty easy fix.
Where to Use It
When you start considering effectively breaking foreign-key relationships, one of the first things you need to ensure is that you aren’t breaking apart two things that really want to be one. If you’re worried that you are breaking apart an aggregate, pause and reconsider. In the case of the Ledger and Albums here, it seems clear we have two separate aggregates with a relationship between them. But consider a different case: an Order table, and lots of associated rows in an Order Line table containing details of the items we have ordered. If we split out order lines into a separate service, we’d have data integrity issues to consider. Really, the lines of an order are part of the order itself. We should therefore see them as a unit, and if we wanted to move them out of the monolith, they should be moved together.
Sometimes, by taking a bigger bite out of the monolithic schema, you may be able to move both sides of a foreign-key relationship with you, making your life much easier!
Example: Shared Static Data
Static reference data (which changes infrequently, yet is typically critical) can create some interesting challenges, and I’ve seen multiple approaches for managing it. More often than not, it finds its way into the database. I have seen perhaps as many country codes stored in databases (shown in Figure 4-39) as I have written my own
StringUtils classes for Java projects.
I’ve always questioned why small amounts of infrequently changing data like country codes need to exist in databases, but whatever the underlying reason, these examples of shared static data being stored in databases come up a lot. So what do we do in our music shop many parts of our code need the same static reference data? Well, it turns out we have a lot of options here.
Pattern: duplicate static reference data
Why not just have each service have its own copy of the data, as in Figure 4-40? This is likely to elicit a sharp intake of breath from many of you. Duplicate data? Am I mad? Hear me out! It’s less crazy than you’d think.
Concerns around duplication of data tend to come down to two things. First, each time I need to change the data, I have to do so in multiple places. But in this situation, how often does the data change? The last time a country was created and officially recognized was in 2011, with the creation of South Sudan (the short code of which is SSD). So I don’t think that’s much of a concern, is it? The bigger worry is, what happens if the data is inconsistent? For example, the Finance service knows that South Sudan is a country, but inexplicably, the Warehouse service is living in the past and knows nothing about it.
Whether or not inconsistency is an issue comes down to how the data is used. In our example, consider that the Warehouse uses this country code data to record where our CDs are manufactured. It turns out that we don’t stock any CDs that are made in South Sudan, so the fact that we’re missing this data isn’t an issue. On the other hand, the Finance service needs country code information to record information about sales, and we have customers in South Sudan, so we need that updated entry.
When the data is used only locally within each service, the inconsistency is not a problem. Think back to our definition of a bounded context: it’s all about information being hidden within boundaries. If, on the other hand, the data is part of the communication between these services, then we have different concerns. If both Warehouse and Finance need the same view of country information, duplication of this nature is definitely something I’d worry about.
We could also consider keeping these copies in sync using some sort of background process, of course. In such a situation, we are unlikely to guarantee that all copies will be consistent, but assuming our background process runs frequently (and quickly) enough, then we can reduce the potential window of inconsistency in our data, and that might be good enough.
As developers, we often react badly when we see duplication. We worry about the extra cost of managing duplicate copies of information, and are even more concerned if this data diverges. But sometimes duplication is the lesser of two evils. Accepting some duplication in data may be a sensible trade-off if it means we avoid introducing coupling.
Where to use it
This pattern should be used only rarely, and you should prefer some of the options we consider later. It is sometimes useful for large volumes of data, when it’s not essential for all services to see the exact same set of data. Something like postal code files in the UK might be a good fit, where you periodically get updates of the mapping from postal codes to addresses. That’s a fairly large dataset that would probably be painful to manage in a code form. If you want to join to this data directly, that may be another reason to select this option, but I’ll be honest and say I can’t remember ever doing it myself!
Pattern: Dedicated reference data schema
If you really want one source of truth for your country codes, you could relocate this data to a dedicated schema, perhaps one set aside for all static reference data, as we can see in Figure 4-41.
We do have to consider all the challenges of a shared database. To an extent, the concerns around coupling and change are somewhat offset by the nature of the data. It changes infrequently, and is simply structured, and therefore we could more easily consider this Reference Data schema to be a defined interface. In this situation, I’d manage the Reference Data schema as its own versioned entity, and ensure that people understood that the schema structure represents a service interface to consumers. Making breaking changes to this schema is likely to be painful.
Having the data in a schema does open up the opportunity for services to still use this data as part of join queries on their own local data. For this to happen, though, you’d likely need to ensure that the schemas are located on the same underlying database engine. That adds a degree of complexity in terms of how you map from the logical to the physical world, quite aside from the potential single-point-of-failure concerns.
Where to use it
This option has a lot of merits. We avoid the concerns around duplication, and the format of the data is highly unlikely to change, so some of our coupling concerns are mitigated. For large volumes of data, or when you want the option of cross-schema joins, it’s a valid approach. Just remember, any changes to the schema format will likely cause significant impact across multiple services.
Pattern: Static reference data library
When you get down to it, there aren’t many entries in our country code data. Assuming we’re using the ISO standard listing, we’re looking at only 249 countries being represented.6 This would fit nicely in code, perhaps as a simple static enumerated type. In fact, storing small volumes of static reference data in code form is something that I’ve done a number of times, and something I’ve seen done for microservice architectures.
Of course, we’d rather not duplicate this data if we don’t have to, so this leads us to consider placing this data into a library that can be statically linked by any services that want this data. Stitch Fix, a US-based fashion retailer, makes frequent use of shared libraries like this to store static reference data.
Randy Shoup, who was VP of engineering at Stitch Fix said the sweet spot for this technique was for types of data that were small in volume and that changed infrequently or not at all (and if it did change, you had a lot of up-front warning about the change). Consider classic clothes sizing—XS, S, M, L, XL for general sizes, or inseam measurements for trousers.
In our case, we define our country code mappings in a
Country enumerated type, and bundle this into a library for use in our services, as shown in Figure 4-42.
This is a neat solution, but it’s not without drawbacks. Obviously, if we have a mix of technology stacks, we may not be able to share a single shared library. Also, remember the golden rule of microservices? We need to ensure that our microservices are independently deployable. If we needed to update our country codes library, and have all services pick up the new data immediately, we’d need to redeploy all services at the moment the new library is available. This is a classic lock-step release, and exactly what we’re trying to avoid with microservice architectures.
In practice, if we need the same data to be available everywhere, then sufficient notice of the change may help. An example Randy gave was the need to add a new color to one of Stitch Fix’s datasets. This change needed to be rolled out to all services that made use of this datatype, but they had significant lead time to make sure all the teams pulled in the latest version. If you consider the country codes example, again we’d likely have a lot of advanced notice if a new country needed to be added. For example, South Sudan became an independent state in July 2011 after a referendum six months earlier, giving us a lot of time to roll out our change. New countries are rarely created on a whim!
This means that if we need to update our country codes library, we would need to accept that not all microservices can be guaranteed to have the same version of the library, as we see in Figure 4-43. If this doesn’t work for you, perhaps the next option may help.
In a simple variation of this pattern, the data in question is held in a configuration file, perhaps a standard properties file or, if required, in a more structured JSON format.
Where to use it
For small volumes of data, where you can be relaxed about different services seeing different versions of this data, this is an excellent but often overlooked option. The visibility regarding which service has what version of data is especially useful.
Pattern: Static reference data service
I suspect you can see where this is ending up. This is a book about creating microservices, so why not consider creating a dedicated service just for country codes, as in Figure 4-44?
I’ve run through this exact scenario with groups all over the world, and it tends to divide the room. Some people immediately think, “That could work!” Typically, though, a larger portion of the group will start shaking their heads and saying something along the lines of, “That looks crazy!” When digging deeper, we get to the heart of their concern; this seems like a lot of work and potential added complexity for not much benefit. The word “overkill” comes up frequently!
So let’s explore this a bit further. When I chat to people and try to understand why some people are fine with this idea, and others are not, it typically comes down to a couple of things. People who work in an environment where creating and managing a microservice is low are much more likely to consider this option. If, on the other hand, creating a new service, even one as simple as this, requires days or even weeks of work, then people will understandably push back on creating a service like this.
Ex-colleague and fellow O’Reilly author Kief Morris7 told me about his experiences at a project for a large international bank based in the UK, where it took nearly a year to get approval for the first release of some software. Over 10 teams inside the bank had to be consulted first before anything could go live—everything from getting designs signed off, to getting machines provisioned for deployment. Such experiences are far from uncommon in larger organizations, unfortunately.
In organizations where deploying new software requires lots of manual work, approvals, and perhaps even the need to procure and configure new hardware, the inherent cost of creating services is significant. In such an environment, I would therefore need to be highly selective in creating new services; they’d have to be delivering a lot of value to justify the extra work. This may make the creation of something like the country code unjustifiable. If, on the other hand, I could spin up a service template and push it to production in the space of a day or less, and have everything done for me, then I’d be much more likely to consider this as a viable option.
Even better, a Country Code service would be a great fit for something like a Function-as-a-Service platform like Azure Cloud Functions or AWS Lambda. The lower operations cost for functions is attractive, and they’re a great fit for simple services like the Country Code service.
Another concern cited is that by adding a service for country codes, we’d be adding yet another networked dependency that could impact latency. I think that this approach is no worse, and may be faster, than having a dedicated database for this information. Why? Well, as we’ve already established, there are only 249 entries in this dataset. Our Country Code service could easily hold this in memory and serve it up directly. Our Country Code service would likely just store these records in code, no baking datastore needed.
This data can, of course, also be aggressively cached at the client side. We don’t add new entries to this data often, after all! We could also consider using events to let consumers know when this data has changed, as shown in Figure 4-45. When the data changes, interested consumers can be alerted via events and use this to update their local caches. I suspect that a traditional TTL-based client cache is likely to be good enough in this scenario, given the low change frequency, but I have used a similar approach for a general-purpose Reference Data service many years ago to great effect.
Where to use it
I’d reach for this option if I was managing the life cycle of this data itself in code. For example, if I wanted to expose an API to update this data, I’d need somewhere for that code to live, and putting that in a dedicated microservice makes sense. At that point, we have a microservice encompassing the state machine for this state. This also makes sense if you want to emit events when this data changes, or just where you want to provide a more convenient contact against which to stub for testing purposes.
The major issue here always seems to come down to the cost of creating yet another microservice. Does it add enough to justify the work, or would one of these other approaches be a more sensible option?
What would I do?
OK, again I’ve given you lots of options. So what would I do? I suppose I can’t sit on the fence forever, so here goes. If we assume that we don’t need to ensure that the country codes are consistent across all services at all times, then I’d likely keep this information in a shared library. For this sort of data, it seems to make much more sense than duplicating this data in local service schemas; the data is simple in nature, and small in volume (country codes, dress sizes, and the like). For more complex reference data or for larger volumes, this might tip me toward putting this into the local database for each service.
If the data needs to be consistent between services, I’d look to create a dedicated service (or perhaps serve up this data as part of a larger-scoped static reference service). I’d likely resort to having a dedicated schema for this sort of data only if it was difficult to justify the work to create a new service.
When breaking apart our databases, we’ve already touched on some of the problems that can result. Maintaining referential integrity becomes problematic, latency can increase, and we can make activities like reporting more complex. We’ve looked at various coping patterns for some of these challenges, but one big one remains: what about transactions?
The ability to make changes to our database in a transaction can make our systems much easier to reason about, and therefore easier to develop and maintain. We rely on our database ensuring the safety and consistency of our data, leaving us to worry about other things. But when we split data across databases, we lose the benefit of using a database transaction to apply changes in state in an atomic fashion.
Before we explore how to tackle this issue, let’s look briefly at what a normal database transaction gives us.
Typically, when we talk about database transactions, we are talking about ACID transactions. ACID is an acronym outlining the key properties of database transactions that lead to a system we can rely on to ensure the durability and consistency of our data storage. ACID stands for atomicity, consistency, isolation, and durability, and here is what these properties give us:
Ensures that all operations completed within the transaction either all complete or all fail. If any of the changes we’re trying to make fail for some reason, then the whole operation is aborted, and it’s as though no changes were ever made.
Allows multiple transactions to operate at the same time without interfering. This is achieved by ensuring that any interim state changes made during one transaction are invisible to other transactions.
It’s worth noting that not all databases provide ACID transactions. All relational database systems I’ve ever used do, as do many of the newer NoSQL databases like Neo4j. MongoDB for many years supported ACID transactions around only a single document, which could cause issues if you wanted to make an atomic update to more than one document.8
This isn’t the book for a detailed, deep dive into these concepts; I’ve certainly simplified some of these descriptions for the sake of brevity. For those of you who would like to explore these concepts further, I recommend Designing Data-Intensive Applications.9 We’ll mostly concern ourselves with atomicity in what follows. That’s not to say that the other properties aren’t also important, but that atomicity tends to be the first issue we hit when splitting apart transactional boundaries.
Still ACID, but Lacking Atomicity?
I want to be clear that we can still use ACID-style transactions when we split databases apart, but the scope of these transactions is reduced, as is their usefulness. Consider Figure 4-46. We are keeping track of the process involved in onboarding a new customer to our system. We’ve reached the end of the process, which involves changing the Status of the customer from PENDING to VERIFIED. As the enrollment is now complete, we also want to remove the matching row from the PendingEnrollments table. With a single database, this is done in the scope of a single ACID database transaction—either both the new rows are written, or neither are written.
Compare this with Figure 4-47. We’re making exactly the same change, but now each change is made in a different database. This means there are two transactions to consider, each of which could work or fail independently of the other.
We could decide to sequence these two transactions, of course, removing a row from the PendingEnrollments table only if we were able to change the row in the Customer table. But we’d still have to reason about what to do if the deletion from the PendingEnrollments table then failed—all logic that we’d need to implement ourselves. Being able to reorder steps to make handling these use cases can be a really useful idea, though (one we’ll come back to when we explore sagas). But fundamentally by decomposing this operation into two separate database transactions, we have to accept that we’ve lost guaranteed atomicity of the operation as a whole.
This lack of atomicity can start to cause significant problems, especially if we are migrating systems that previously relied on this property. It’s at this point that people start to look for other solutions to give them some ability to reason about changes being made to multiple services at once. Normally, the first option that people start considering is distributed transactions. Let’s look at one of the most common algorithms for implementing distributed transactions, the two-phase commit, as a way of exploring the challenges associated with distributed transactions as a whole.
The two-phase commit algorithm (sometimes shortened to 2PC) is frequently used to attempt to give us the ability to make transactional changes in a distributed system, where multiple separate processes may need to be updated as part of the overall operation. I want to let you know up front that 2PCs have limitations, which we’ll cover, but they’re worth knowing about. Distributed transactions, and two-phased commits more specifically, are frequently raised by teams moving to microservice architectures as a way of solving challenges they face. But as we’ll see, they may not solve your problems, and may bring even more confusion to your system.
The algorithm is broken into two phases (hence the name two-phase commit): a voting phase and a commit phase. During the voting phase, a central coordinator contacts all the workers who are going to be part of the transaction, and asks for confirmation as to whether or not some state change can be made. In Figure 4-48, we see two requests, one to change a customer status to VERIFIED, another to remove a row from our PendingEnrollments table. If all the workers agree that the state change they are asked for can take place, the algorithm proceeds to the next phase. If any workers say the change cannot take place, perhaps because the requested state change violates some local condition, the entire operation aborts.
It’s important to highlight that the change does not take effect immediately after a worker indicates that it can make the change. Instead, the worker is guaranteeing that it will be able to make that change at some point in the future. How would the worker make such a guarantee? In Figure 4-48, for example, Worker A has said it will be able to change the state of the row in the Customer table to update that specific customer’s status to be VERIFIED. What if a different operation at some later point deletes the row, or makes another smaller change that nonetheless means that a change to VERIFIED later is invalid? To guarantee that this change can be made later, Worker A will likely have to lock that record to ensure that such a change cannot take place.
If any workers didn’t vote in factor of the commit, a rollback message needs to be sent to all parties, to ensure that they can clean up locally, which allows the workers to release any locks they may be holding. If all workers agreed to make the change, we move to the commit phase, as in Figure 4-49. Here, the changes are actually made, and associated locks are released.
It’s important to note that in such a system, we cannot in any way guarantee that these commits will occur at exactly the same time. The coordinator needs to send the commit request to all participants, and that message could arrive at and be processed at different times. This means it’s possible that we could see the change made to Worker A, but not yet see the change to Worker B, if we allow for you to view the states of these workers outside the transaction coordinator. The more latency there is between the coordinator, and the slower it is for the workers to process the response, the wider this window of inconsistency might be. Coming back to our definition of ACID, isolation ensures that we don’t see intermediate states during a transaction. But with this two-phase commit, we’ve lost that.
When two-phase commits work, at their heart they are very often just coordinating distributed locks. The workers need to lock local resources to ensure that the commit can take place during the second phase. Managing locks, and avoiding deadlocks in a single-process system, isn’t fun. Now imagine the challenges of coordinating locks among multiple participants. It’s not pretty.
There are a host of failure modes associated with two-phase commits that we don’t have time to explore. Consider the problem of a worker voting to proceed with the transaction, but then not responding when asked to commit. What should we do then? Some of these failure modes can be handled automatically, but some can leave the system in such a state that things need to be manually unpicked.
The more participants you have, and the more latency you have in the system, the more issues a two-phase commit will have. They can be a quick way to inject huge amounts of latency into your system, especially if the scope of locking is large, or the duration of the transaction is large. It’s for this reason two-phase commits are typically used only for very short-lived operations. The longer the operation takes, the longer you’ve got resources locked for!
Distributed Transactions—Just Say No
For all these reasons outlined so far, I strongly suggest you avoid the use of distributed transactions like the two-phase commit to coordinate changes in state across your microservices. So what else can you do?
Well, the first option could be to just not split the data apart in the first place. If you have pieces of state that you want to manage in a truly atomic and consistent way, and you cannot work out how to sensibly get these characteristics without an ACID-style transaction, then leave that state in a single database, and leave the functionality that manages that state in a single service (or in your monolith). If you’re in the process of working out where to split your monolith, and working out what decompositions might be easy (or hard), then you could well decide that splitting apart data that is currently managed in a transaction is just too hard to handle right now. Work on some other area of the system, and come back to this later.
But what happens if you really do need to break this data apart, but you don’t want all the pain of managing distributed transactions? How can we carry out operations in multiple services but avoid locking? What if the operation is going to take minutes, days, or perhaps even months? In cases like this, we can consider an alternative approach: sagas.
Unlike a two-phase commit, a saga is by design an algorithm that can coordinate multiple changes in state, but avoids the need for locking resources for long periods of time. We do this by modeling the steps involved as discrete activities that can be executed independently. It comes with the added benefit of forcing us to explicitly model our business processes, which can have significant benefits.
The core idea, first outlined by Hector Garcia-Molina and Kenneth Salem,10 reflected on the challenges of how best to handle operations of what they referred to as long lived transactions (LLTs). These transactions might take a long time (minutes, hours, or perhaps even days), and as part of that process require changes to be made to a database.
If you directly mapped an LLT to a normal database transaction, a single database transaction would span the entire life cycle of the LLT. This could result in multiple rows or even full tables being locked for long periods of time while the LLT is taking place, causing significant issues if other processes are trying to read or modify these locked resources.
Instead, the authors of the paper suggest we should break down these LLTs into a sequence of transactions, each of which can be handled independently. The idea is that the duration of each of these “sub” transactions will be shorter lived, and will modify only part of the data affected by the entire LLT. As a result, there will be far less contention in the underlying database as the scope and duration of locks is greatly reduced.
While sagas were originally envisaged as a mechanism to help with LLTs acting against a single database, the model works just as well for coordinating change across multiple services. We can break a single business process into a set of calls that will be made to collaborating services as part of a single saga.
Before we go any further, you need to understand that a saga does not give us atomicity in ACID terms we are used to with a normal database transaction. As we break the LLT into individual transactions, we don’t have atomicity at the level of the saga itself. We do have atomicity for each subtransaction inside the LLT, as each one of them can relate to an ACID transactional change if needed. What a saga gives us is enough information to reason about which state it’s in; it’s up to us to handle the implications of this.
Let’s take a look at a simple order fulfillment flow, outlined in Figure 4-50, which we can use to further explore sagas in the context of a microservice architecture.
Here, the order fulfillment process is represented as a single saga, with each step in this flow representing an operation that can be carried out by a different service. Within each service, any state change can be handled within a local ACID transaction. For example, when we check and reserve stock using the Warehouse service, internally the Warehouse service might create a row in its local Reservation table recording the reservation; this change would be handled within a normal transaction.
Saga Failure Modes
With a saga being broken into individual transactions, we need to consider how to handle failure—or, more specifically, how to recover when a failure happens. The original saga paper describes two types of recovery: backward recovery and forward recovery.
Backward recovery involves reverting the failure, and cleaning up afterwards—a rollback. For this to work, we need to define compensating actions that allow us to undo previously committed transactions. Forward recovery allows us to pick up from the point where the failure occurred, and keep processing. For that to work, we need to be able to retry transactions, which in turn implies that our system is persisting enough information to allow this retry to take place.
Depending on the nature of the business process being modeled, you may consider that any failure mode triggers a backward recovery, a forward recovery, or perhaps a mix of the two.
With an ACID transaction, a rollback occurs before a commit. After the rollback, it is like nothing ever happened: the change we were trying to make didn’t take place. With our saga, though, we have multiple transactions involved, and some of those may have already committed before we decide to roll back the entire operation. So how can we roll back transactions after they have already been committed?
Let’s come back to our example of processing an order, as outlined in Figure 4-50. Consider a potential failure mode. We’ve gotten as far as trying to package the item, only to find the item can’t be found in the warehouse, as shown in Figure 4-51. Our system thinks the item exists, but it’s just not on the shelf!
Now, let’s assume we decide we want to just roll back the entire order, rather than giving the customer the option for the item to be placed on back order. The problem is that we’ve already taken payment and awarded loyalty points for the order.
If all of these steps had been done in a single database transaction, a simple rollback would clean this all up. However, each step in the order fulfillment process was handled by a different service call, each of which operated in a different transactional scope. There is no simple “rollback” for the entire operation.
Instead, if you want to implement a rollback, you need to implement a compensating transaction. A compensating transaction is an operation that undoes a previously committed transaction. To roll back our order fulfillment process, we would trigger the compensating transaction for each step in our saga that has already been committed, as shown in Figure 4-52.
It’s worth calling out the fact that these compensating transactions can’t have exactly the same behavior as that of a normal database rollback. A database rollback happens before the commit; and after the rollback, it is as though the transaction never happened. In this situation, of course, these transactions did happen. We are creating a new transaction that reverts the changes made by the original transaction, but we can’t roll back time and make it as though the original transaction didn’t occur.
Because we cannot always cleanly revert a transaction, we say that these compensating transactions are semantic rollbacks. We cannot always clean up everything, but we do enough for the context of our saga. As an example, one of our steps may have involved sending an email to a customer to tell them their order was on the way. If we decide to roll that back, you can’t unsend an email!11 Instead, your compensating transaction could cause a second email to be sent to the customer, informing them that there had been a problem with the order and it had been canceled.
It is totally appropriate for information related to the rollback saga to persist in the system. In fact, this may be very important information. You may want to keep a record in the Order service for this aborted order, along with information about what happened, for a whole host of reasons.
Reordering steps to reduce rollbacks
In Figure 4-52, we could have made our likely rollback scenarios somewhat simpler by reordering the steps. A simple change would be to award points only when the order was actually dispatched, as seen in Figure 4-53. This way, we’d avoid having to worry about that stage being rolled back if we had a problem while trying to package and send the order. Sometimes you can simplify your rollback operations just by tweaking how the process is carried out. By pulling forward those steps that are most likely to fail and failing the process earlier, you avoid having to trigger later compensating transactions as those steps weren’t even triggered in the first place.
These changes, if they can be accommodated, can make your life much easier, avoiding the need to even create compensating transactions for some steps. This can be especially important if implementing a compensating transaction is difficult. You may be able to move the step later in the process to a stage where it never needs to be rolled back.
Mixing fail-backward and fail-forward situations
It is totally appropriate to have a mix of failure recovery modes. Some failures may require a rollback; others may be fail forward. For the order processing, for example, once we’ve taken money from the customer, and the item has been packaged, the only step left is to dispatch the package. If for whatever reason we can’t dispatch the package (perhaps the delivery firm we have doesn’t have space in their vans to take an order today), it seems very odd to roll the whole order back. Instead, we’d probably just retry the dispatch, and if that fails, require human intervention to resolve the situation.
So far, we’ve looked at the logical model for how sagas work, but we need to go a bit deeper to examine ways of implementing the saga itself. We can look at two styles of saga implementation. Orchestrated sagas more closely follow the original solution space and rely primarily on centralized coordination and tracking. These can be compared to choreographed sagas, which avoid the need for centralized coordination in favor of a more loosely coupled model, but which can make tracking the progress of a saga more complicated.
Orchestrated sagas use a central coordinator (what we’ll call an orchestrator from now on) to define the order of execution and to trigger any required compensating action. You can think of orchestrated sagas as a command-and-control approach: the central orchestrator controls what happens and when, and with that comes a good degree of visibility as to what is happening with any given saga.
Here, our central Order Processor, playing the role of the orchestrator, coordinates our fulfillment process. It knows what services are needed to carry out the operation, and it decides when to make calls to those services. If the calls fail, it can decide what to do as a result. These orchestrated processors tend to make heavy use of request/response calls between services: the Order Processor sends a request to services (such as a Payment Gateway), and expects a response letting it know if the request was successful and providing the results of the request.
Having our business process explicitly modeled inside the Order Processor is extremely beneficial. It allows us to look at one place in our system and understand how this process is supposed to work. That can make onboarding of new people easier, and help impart a better understanding of the core parts of the system.
There are a few downsides to consider, though. First, by its nature, this is a somewhat coupled approach. Our Order Processor needs to know about all the associated services, resulting in a higher degree of what we discussed in Chapter 1 as domain coupling. While not inherently bad, we’d still like to keep domain coupling to a minimum if possible. Here, our Order Processor needs to know about and control so many things that this form of coupling is hard to avoid.
The other issue, which is more subtle, is that logic that should otherwise be pushed into the services can start to instead become absorbed in the orchestrator. If this starts happening, you may find your services becoming anemic, with little behavior of their own, just taking orders from orchestrators like the Order Processor. It’s important you still consider the services that make up these orchestrated flows as entities that have their own local state and behavior. They are in charge of their own local state machines.
If logic has a place where it can be centralized, it will become centralized!
One of the ways to avoid too much centralization with orchestrated flows can be to ensure you have different services playing the role of the orchestrator for different flows. You might have an Order Processor service that handles placing an order, a Returns service to handle the return and refund process, a Goods Receiving service that handles new stock arriving at the warehouse and being put on the shelves, and so on. Something like our Warehouse service may be used by all those orchestrators; such a model makes it easier for you to keep functionality in the Warehouse service itself to allow you to reuse functionality across all those flows.
Choreographed sagas aim to distribute responsibility for the operation of the saga among multiple collaborating services. If orchestration is command-and-control, choreographed sagas represent a trust-but-verify architecture. As we’ll see in our example in Figure 4-55, choreographed sagas will often make heavy use of events for collaboration between services.
There’s quite a bit going on here, so it’s worth exploring in more detail. First, these services are reacting to events being received. Conceptually, events are broadcast in the system, and interested parties are able to receive them. You don’t send events to a service; you just fire them out, and the services that are interested in these events are able to receive them and act accordingly. In our example, when the Warehouse service receives that first Order Placed event, it knows its job to reserve the appropriate stock and fire an event once that is done. If the stock couldn’t be received, the Warehouse would need to raise an appropriate event (an Insufficient Stock event perhaps), which might lead to the order being aborted.
Typically, you’d use some sort of message broker to manage the reliable broadcast and delivery of events. It’s possible that multiple services may react to the same event, and that is where you would use a topic. Parties interested in a certain type of event would subscribe to a specific topic without having to worry about where these events came from, and the broker ensures the durability of the topic and that the events on it are successfully delivered to subscribers. As an example, we might have a Recommendation service that also listens to Order Placed events and uses that to construct a database of music choices you might like.
In the preceding architecture, no one service knows about any other service. They only need to know what to do when a certain event is received. Inherently, this makes for a much less coupled architecture. As the implementation of the process is decomposed and distributed among the four services here, we also avoid the concerns about centralization of logic (if you don’t have a place where logic can be centralized, then it won’t be centralized!).
The flip side of this is that it can now be harder to work out what is going on. With orchestration, our process was explicitly modeled in our orchestrator. Now, with this architecture as it is presented, how would you build up a mental model of what the process is supposed to be? You’d have to look at the behavior of each service in isolation and reconstitute this picture in your own head—far from a straightforward process even with a simple business process like this one.
The lack of an explicit representation of our business process is bad enough, but we also lack a way of knowing what state a saga is in, which can also deny us the chance to attach compensating actions when required. We can push some responsibility to the individual services for carrying out compensating actions, but fundamentally we need a way of knowing what state a saga is in for some kinds of recovery. The lack of a central place to interrogate around the status of a saga is a big problem. We get that with orchestration, so how do we solve that here?
One of the easiest ways of doing this is to project a view regarding the state of a saga from the existing system by consuming the events being emitted. If we generate a unique ID for the saga, we can put this into all of the events that are emitted as part of this saga-this is what is known as a correlation ID. We could then have a service whose job it is to just vacuum up all these events and present a view of what state each order is in, and perhaps programmatically carry out actions to resolve issues as part of the fulfillment process if the other services couldn’t do it themselves.
While it may seem that orchestrated and choreographed sagas are diametrically opposing views on how sagas could be implemented, you could easily consider mixing and matching models. You may have some business processes in your system that more naturally fit one model or another. You may also have a single saga that has a mix of styles. In the order fulfillment use case, for example, inside the boundary of the Warehouse service, when managing the packaging and dispatch of a package, we may use an orchestrated flow even if the original request was made as part of a larger choreographed saga.12
If you do decide to mix styles, it’s important that you still have a clear way to understand what has happened as part of the saga. Without this, understanding failure modes becomes complex, and recovery from failure difficult.
Should I use choreography or orchestration?
Implementing choreographed sagas can bring with it ideas that may be unfamiliar to you and your team. They typically assume heavy use of event-driven collaboration, which isn’t widely understood. However, in my experience, the extra complexity associated with tracking the progress of a saga is almost always outweighed by the benefits associated with having a more loosely coupled architecture.
Stepping aside from my own personal tastes, though, the general advice I give regarding orchestration versus choreography is that I am very relaxed in the use of orchestrated sagas when one team owns implementation of the entire saga. In such a situation, the more inherently coupled architecture is much easier to manage within the team boundary. If you have multiple teams involved, I greatly prefer the more decomposed choreographed saga as it is easier to distribute responsibility for implementing the saga to the teams, with the more loosely coupled architecture allowing these teams to work more in isolation.
Sagas Versus Distributed Transactions
As I hope I have broken down by now, distributed transactions come with some significant challenges, and outside of some very specific situations are something I tend to avoid. Pat Helland, a pioneer in distributed systems, distills the fundamental challenges with implementing distributed transactions for the kinds of applications we build today:13
In most distributed transaction systems, the failure of a single node causes transaction commit to stall. This in turn causes the application to get wedged. In such systems, the larger it gets, the more likely the system is going to be down. When flying an airplane that needs all of its engines to work, adding an engine reduces the availability of the airplane.
Pat Helland, Life Beyond Distributed Transactions
In my experience, explicitly modeling business processes as a saga avoids many of the challenges of distributed transactions, while at the same time has the added benefit of making what might otherwise be implicitly modeled processes much more explicit and obvious to your developers. Making the core business processes of your system a first-class concept will have a host of benefits.
A fuller discussion of implementing orchestration and choreography, along with the various implementation details, is outside the scope of this book. It is covered in Chapter 4 of Building Microservices, but I also recommend Enterprise Integration Patterns for a deep dive into many aspects of this topic.14
We decompose our system by finding seams along which service boundaries can emerge, and this can be an incremental approach. By getting good at finding these seams and working to reduce the cost of splitting out services in the first place, we can continue to grow and evolve our systems to meet whatever requirements come down the road. As you can see, some of this work can be painstaking, and it can cause significant issues that we will need to address. But the fact that it can be done incrementally means there is no need to fear this work.
In splitting our services, we’ve introduced some new problems too. In our next chapter, we’ll take a look at the various challenges that will emerge as you break down your monolith. But don’t worry, I’ll also give you a host of ideas to help you deal with these problems as they arise.
1 When you’re relying on network analysis to determine who is using your database, you’re in trouble.
2 It was rumored that one of the systems using our database was a Python-based neural net that no one understood but “just worked.”
5 Maintaining historical data in a relational database like this can get complicated, especially if you need to programmatically reconstitute old versions of your entities. If you have heavy requirements in this space, exploring event sourcing as an alternative way of maintaining state would be worthwhile.
6 That’s ISO 3166-1 for all you ISO fans out there!
7 Kief wrote Infrastructure as Code: Managing Servers in the Cloud (Sebastopol: O’Reilly, 2016).
8 This has now changed with support for multidocument ACID transactions, which was released as part of Mongo 4.0. I haven’t used this feature of Mongo myself; I just know it exists!
9 See Martin Kleppmann, Designing Data-Intensive Applications (Sebastopol, O’Reilly Media, Inc., 2017).
10 See Hector Garcia-Molina and Kenneth Salem, “Sagas,” in ACM Sigmod Record 16, no. 3 (1987): 249–259.
11 You really can’t. I’ve tried!
12 It’s outside the scope of this book, but Hector Garcia-Molina and Kenneth Salem went on to explore how multiple sagas could be “nested” to implement more complex processes. To read more on this topic, see Hector Garcia-Molina et al, “Modeling Long-Running Activities as Nested Sagas,” Data Engineering 14, no. 1 (March 1991: 14–18.
13 See Pat Helland, “Life Beyond Distributed Transactions,” acmqueue 14, no. 5.
14 Sagas are not mentioned explicitly in either book, but orchestration and choreography are both covered. While I can’t speak to the experience of the authors of Enterprise Integration Patterns, I personally was unaware of sagas when I wrote Building Microservices.