Chapter 4. Slack’s Disasterpiece Theater

Richard Crowley

How do you get into Chaos Engineering if your team and tools weren’t born with it? It can seem like an overwhelming and insurmountable task to retrofit chaos into systems designed with the mindset that computers can and should last a long time. Complex systems born from this mindset tend to be less accommodating of extreme transience in the underlying computers than their cloud native successors. Such systems probably perform very well in optimal conditions but degrade quickly and sometimes catastrophically in case of failure.

You may be the proud owner of just such a system. It wasn’t designed to accommodate chaos, but whether you like it or not, chaos is coming as its scale increases and ongoing development asks it to do more, faster, and more reliably. There isn’t time for a rewrite—the system is under duress already. Applying new Chaos Engineering practices to old systems is liable to make the situation worse. You need a different strategy.

This chapter describes one strategy for safely and systematically testing complex systems that weren’t necessarily designed with Chaos Engineering in mind by introducing failures and network partitions in a thoughtful and controlled manner. This is a process, not automation, that helps your team understand how your software is vulnerable, motivates improvements, and validates that systems tolerate the faults you can anticipate. This process has been in active use at Slack since the beginning of 2018. Through more than twenty exercises it has discovered vulnerabilities, proven the safety of systems new and old, and affected the roadmaps of many engineering teams.

The first step, though, is ensuring the systems in question are even theoretically ready to tolerate the kinds of faults you expect them to encounter.

Retrofitting Chaos

The tools and techniques you might use to make a system more fault tolerant are the same as you might use to modernize it, make it cloud native, make it more reliable, or make it more highly available. Let’s review.

Design Patterns Common in Older Systems

Existing systems, especially older existing systems, are more likely than new systems being built today to assume that individual computers last a long time. This simple assumption is at the heart of many systems that are fault intolerant. We made this assumption in an era in which spare computers were to be avoided—that was wasteful—and it has lingered in our systems design since then.

When computers were scarce they were likely to be provisioned with an operating system and all the trimmings approximately once, shortly after they were purchased, and upgraded in place throughout their useful life. The provisioning process might have been heavily automated, especially if many computers showed up on the loading dock all at once, but initiating that process was probably manual. In smaller installations it wasn’t uncommon for much more of that provisioning process to be manual.

Failover, too, was commonly a manual action taken by a human who judged that to be the appropriate response to some fault or deviation from normal operations. In particularly old systems, the period between the fault and the failover was an outage inflicted on customers. Failover was thought to be rare so automation and, in some cases, even documentation and training weren’t obviously worthwhile.

Backup and restore is another area in which existing systems may trail behind the state of the art. On a positive note, backups are almost certainly being taken. It’s not as certain, though, that those backups can be restored or that they can be restored quickly. As with failover, restoring from backup was at one time a rare event for which automation wasn’t obviously worthwhile.

We more readily accept the potential impact of unlikely events—maybe they’ll never happen at all! Existing systems built to accept these risks have a tough time coping when the rate of faults increases with scale or when the impact becomes less acceptable to the business.

For completeness, I want to address monoliths briefly. There is no precise threshold a system crosses and becomes a monolith—it’s relative. Monolithic systems are not inherently more or less fault tolerant than service-oriented architectures. They may, though, be harder to retrofit because of the sheer surface area, difficulty in affecting incremental change, and difficulty limiting the blast radius of failures. Maybe you decide you’re going to break up your monolith, maybe you don’t. Fault tolerance is reachable via both roads.

Design Patterns Common in Newer Systems

By contrast, systems being designed today are likely to assume individual computers come and go frequently. There are many consequences of this new mindset but perhaps the most important is that systems are being designed to run on n computers simultaneously and to continue running when one fails and only n - 1 remain.

Health checks that are deep enough to detect problems but shallow enough to avoid cascading failures from a service’s dependencies play a critical role. They remove failing computers from service and in many cases automatically initiate replacement.

Instance replacement—individual computers tend to be called instances by cloud service providers—is a powerful strategy employed by modern systems. It enables the fault tolerance I just described as well as steady-state operational patterns like blue–green deployment. And within systems that store data, instance replacement provides capacity and motivation to automatically and frequently test that backups can be restored.

Once again, I want to emphasize that a system being more monolithic does not preclude it taking advantage of these design patterns. However, it is a tried-and-true architectural choice to expose new functionality as a service that cooperates with an existing monolith.

Getting to Basic Fault Tolerance

Chaos experiments should be run in production (in addition to development and staging environments) and you should be able to confidently assert that the impact on customers will be negligible if there is any at all. These are a few high-leverage changes you can make should any of the systems you operate align with the design patterns common in older systems.

First and foremost, keep spare capacity online. Having at least an extra computer around during normal operation is the beginning of fault tolerance (and covers more kinds of hardware failures than RAID, which only covers hard disks, or application-level graceful degradation, which may not be possible in your particular application). Use that spare capacity to service requests that arrive while one or a few computers are malfunctioning.

Once you have spare capacity available, consider how to remove malfunctioning computers from service automatically (before you dive into Chaos Engineering). Don’t stop at automatic removal, though. Carry on to automatic replacement. Here, the cloud provides some distinct advantages. It’s easy (and fun) to get carried away optimizing instance provisioning, but a basic implementation of autoscaling that replaces instances as they’re terminated to hold the total number constant will suit most systems. Automated instance replacement must be reliable. Replacement instances must enter service in less than the mean time between failures.

Some systems, especially ones that store data, may differentiate between a leader and many followers. It’s easy (and fun) to get carried away with leader election and consensus, but here too an implementation that merely keeps human actions off the critical path is likely sufficient. The introduction of automated failover is the perfect time to audit the timeout and retry policies in dependent services. You should be looking for short but reasonable timeouts that are long enough to allow the automated failover to complete and retries that exponentially back off with a bit of jitter.

Tabletop exercises, in which your team talks through the details of a system’s expected behavior in the presence of some fault, are useful for convincing yourselves a system is ready. This academic confidence is far from enough in a complex system, though. The only way to earn true confidence is to incite failure in production. The rest of this chapter introduces Slack’s process for doing this safely.

Disasterpiece Theater

I call this process Disasterpiece Theater. When you’re competing with other valuable concerns for your fellow engineers’ time and asking them to change the way they develop and operate software systems, a memorable brand really helps. Disasterpiece Theater was first introduced as a forum on the topic of system failure. It is an ongoing series of exercises in which we get together and purposely cause a part of Slack to fail.

Goals

Every Disasterpiece Theater exercise is a bit different, digging up different hacks from our past, rekindling different fears, and taking different risks. All of them, though, can be walked back to the same fundamental goals.

Outside of the most extreme devotees to crash-only software, most systems are deployed more often than their underlying network and server infrastructure fails. When we design a Disasterpiece Theater exercise we pay very close attention to how faithfully the development environment matches the production environment. It’s important that all software changes be testable in the development environment but it is critical for failures to be able to be practiced there. The benefits of Disasterpiece Theater forcing deviations to be rectified pay dividends during every test suite run and every deploy cycle.

More obviously, a marquee goal when we incite controlled failures is to discover vulnerabilities in our production systems. The planning that goes into these exercises helps mitigate (though never completely) the risk that any unknown vulnerability cascades into customer impact. We’re looking for vulnerabilities to availability, correctness, controllability, observability, and security.

Disasterpiece Theater is an ongoing series of exercises. When an exercise discovers a vulnerability we plan to rerun the exercise to verify that remediation efforts were effective in the same way you rerun a program’s test suite to confirm you’ve fixed a bug that caused a test to fail. More generally, the exercises validate system designs and the assumptions that are embedded within them. Over time, a complex system evolves and may inadvertently invalidate an assumption made long ago in a dependent part of the system. For example, the timeout one service places on requests to a dependency may not be sufficient once that dependency deploys to multiple cloud regions. Organizational and system growth decrease the accuracy of any individual’s model of the system (per the STELLA report); those individuals become less and less likely to even know about all the assumptions made in the design of the system. Regularly validating fault tolerance helps the organization ensure its assumptions hold.

Anti-Goals

So, Disasterpiece Theater is meant to promote parity between development and production environments, motivate reliability improvements, and demonstrate a system’s fault tolerance. I find it is also helpful to be explicit about what a process or tool is not supposed to be.

One size doesn’t fit all but, for Slack, I decided that Disasterpiece Theater exercises should be planned and run to minimize the chance of causing a production incident. Slack is a service used by companies small and large to conduct their business; it is critical that the service is there for them all the time. Stated more formally, Slack does not have sufficient error budget to accept severe or lengthy customer impact as a result of one of these planned exercises. You may have more of an error budget or risk tolerance and, if you wield them effectively, end up learning more, more quickly, thanks to the exercises they allow you to plan.

Data durability is even more of a priority. That isn’t to say that storage systems are not exercised by this process. Rather, it simply means the plans and contingencies for those plans must ensure that data is never irrecoverably lost. This may influence the techniques used to incite failure or motivate holding an extra replica in reserve or manually taking a backup during the exercise. Whatever benefit of Disasterpiece Theater, it isn’t worth losing a customer’s data.

Disasterpiece Theater is not exploratory. When introducing a little failure where none (or very little) has been experienced before, planning is key. You should have a detailed, credible hypothesis about what will happen before you incite the failure. Gathering all the experts and interested humans together in the same room or on the same video conference helps to temper the chaos, educates more engineers on the details of the systems being exercised, and spreads awareness of the Disasterpiece Theater program itself. The next section describes the process from idea to result in detail.

The Process

Every Disasterpiece Theater exercise begins with an idea. Or maybe, more accurately, a worry. It could come from the author and longtime owner of a system, from a discovery made during some unrelated work, as a follow up to a postmortem—anywhere, really. Armed with this worry and the help of one or more experts on the system in question, an experienced host guides us all through the process.

Preparation

You and your cohosts should get together in the same room or on the same video conference to prepare for the exercise. My original Disasterpiece Theater checklist suggests the following, each of which I’ll describe in detail:

Decide on a server or service that will be caused to fail, the failure mode, and the strategy for simulating that failure mode.
Survey the server or service in dev and prod; note your confidence in our ability to simulate the failure in dev.
Identify alerts, dashboards, logs, and metrics that you hypothesize will detect the failure; if none exist, consider inciting the failure anyway and working backwards into detection.
Identify redundancies and automated remediations that should mitigate the impact of the failure and runbooks that may be necessary to respond.
Invite all the relevant people to the event, especially those who will be on call at the time, and announce the exercise in #disasterpiece-theater (a channel in Slack’s own Slack).

I’ve found that most of the time an hour together is enough to get started and the final preparation can be handled asynchronously. (Yes, we use Slack for this.)

Sometimes the worry that inspired the whole exercise is specific enough that you’ll already know precisely the failure you’ll incite, like arranging for a process to pass its health checks but nonetheless fail to respond to requests. Other times, there are many ways to achieve the desired failure and they’re all subtly different. Typically the easiest failure mode to incite, to repair, and to tolerate is stopping a process. Then there is instance termination (especially if you’re in the cloud) which can be expedient if they’re automatically replaced. Using iptables(8) to simulate a computer’s network cable being unplugged is a fairly safe failure mode that’s different from process stoppage and (sometimes) instance termination because the failures manifest as timeouts instead of ECONNREFUSED. And then you can get into the endless and sometimes terrifying world of partial and even asymmetric network partitions, which can usually be simulated with iptables(8).

Then, too, there is the question of where in the system one of these techniques is applied. Single computers are a good start but consider working your way up to whole racks, rows, datacenters, availability zones, or even regions. Larger failures can help us discover capacity constraints and tight coupling between systems. Consider introducing the failure between load balancers and application servers, between some application servers (but not all) and their backing databases, and so on. You should leave this step with very specific steps to take or, better yet, commands to run.

Next, make sure to ground yourself in what’s really possible to safely exercise. Take a close, dispassionate look at your development environment to determine whether the failure you want to introduce can actually be introduced there. Consider, too, whether there is (or can be) enough traffic in your development environment to detect the failure and to experience any potential negative effects like resource exhaustion in a dependent service with poorly configured timeouts, in remaining instances of the degraded service, or in related systems like service discovery or log aggregation.

Pretend for a moment that your development environment tolerates the failure just fine. Does that give you confidence to incite the same failure in your production environment? If not, consider aborting the exercise and investing in your development environment. If so, good work on having a confidence-inspiring development environment! Now take a moment to formalize that confidence. Identify any alerts you expect to be triggered when you incite this failure, all the dashboards, logs, and/or metrics you hypothesize will detect the failure plus those you hypothesize will hold steady. Think of this a bit like “priming” your incident response process. You’re not planning to need it but it’s a worthwhile contingency to ensure your time to detect and time to assemble will be effectively zero should the exercise not go as planned. I hope that, most of the time, you instead need these logs and metrics to confirm your hypothesis.

What is that hypothesis, though? Take some time to write down precisely what you and your cohosts expect will happen. Take several perspectives. Consider the operation of health checks, load balancers, and service discovery around the failure. Think about the fate of the requests that are interrupted by the failure as well as those that arrive shortly after. How does a requesting program learn about the failure? How long does this take? Do any of those programs retry their requests? If so, how aggressively? Will the combination of these timeouts and retries push anything close to the point of resource exhaustion? Now extend your model of the situation to include humans and note any points in which human intervention may be necessary or desirable. Identify any runbooks or documentation that may be necessary. (This, too, serves to “prime” the incident response process.) Finally, try to quantify what customer impact you expect and confirm that it is sufficiently minimal to proceed.

Conclude your preparation by working out the logistics of the exercise. I recommend scheduling at least three hours in a big conference room. In my experience, it’s rare to actually use all three hours but it would be a distraction to have to move during an exercise that’s not going according to plan. If there are any remote participants, use a video conferencing system with a good microphone that picks up the entire room. Convene the cohosts, all the other experts on the system being exercised and its clients, anyone who’s on call, and anyone who wants to learn. These exercises are expensive in human hours, which underscores the importance of thorough preparation. Now that you’re prepared, it’s time for Disasterpiece Theater.

The Exercise

I try to make something of a spectacle of each exercise to maximize awareness of Disasterpiece Theater across the company. This program competes for people’s time; it’s very important for everyone to understand that spending time in Disasterpiece Theater results in a more reliable system with a more confidence-inspiring development environment.

You should designate a note taker. (I have historically played this role during Slack’s Disasterpiece Theater exercises but there is no reason you couldn’t decide differently.) I recommend they take notes in a chat channel or some similar medium that timestamps every message automatically. We take notes in the #disasterpiece-theater channel in Slack’s own Slack.

If at any point during an exercise you find yourself deviating uncomfortably from the plan or encountering unanticipated customer impacts, abort. Learn what you can, regroup, and try again another day. You can learn quite a lot without crossing the threshold into incident response.

My original Disasterpiece Theater checklist continues into the exercise itself and, as with the preparation checklist, I’ll describe each step in detail:

Ensure everyone is comfortable being recorded and, if so, start recording the video conference if possible.
Review the preparation and amend it as necessary.
Announce the dev exercise in #ops (a channel in Slack’s own Slack where we discuss production changes and incidents).
Incite the failure in dev. Note the time.
Receive alerts and inspect dashboards, logs, and metrics. Note the time when they provide definitive evidence of the failure.
If applicable, give automated remediations time to be triggered. Note the time they are.
If necessary, follow runbooks to restore service in dev. Note the time and any deviations required.
Make a go or no-go decision to proceed to prod. If no-go, announce the all clear in #ops, debrief, and stop. If go, go.
Announce the prod exercise in #ops.
Incite the failure in prod. Note the time.
Receive alerts and inspect dashboards, logs, and metrics. Note the time when they provide definitive evidence of the failure.
If applicable, give automated remediations time to be triggered. Note the time they are.
If necessary, follow runbooks to restore service in prod. Note the time and any deviations required.
Announce the all clear in #ops.
Debrief.
If there is one, distribute the recording once it’s available.

I like to have an audio recording of the exercise to refer back to as insurance in case something important isn’t captured or isn’t captured clearly in the notes taken in real time. It’s important, though, to be sure everyone who’s participating is OK being recorded. Get this out of the way first and, if possible, start recording.

Begin with a thorough review of the plan. This is likely some of the participants’ first exposure to it. Their unique perspectives may improve the plan. Incorporate their feedback, especially when it makes the exercise safer or the results more meaningful. We publish plans ahead of time in shared documents and update them with these changes. Be wary, though, of deviating too far from the plan on a whim, as this can turn a safe and well-planned exercise into an obstacle course.

When the plan is ratified, announce the exercise in a very public place like a chat channel where updates about system operations are expected, an engineering-wide mailing list, or the like. This first announcement should say the exercise is commencing in the development environment and direct onlookers to follow along there. See Example 4-1 for what a typical announcement looks like at Slack.

Example 4-1. A typical initial Disasterpiece Theater announcement at Slack

Richard Crowley 9:50 AM #disasterpiece-theater is on again and we’re about to mostly unplug the network cables on 1/4 of the Channel Servers in dev. Follow along in the channel or await my announcement here when we’re moving on to prod.

Now for the moment of truth (in the development environment, at least). One of your cohosts should run the prepared command to incite the failure. Your note taker should record the time.

This is the time for all the participants (except the note taker) to spring into action. Collect evidence of the failure, the recovery, and the impact on adjacent systems. Confirm or disconfirm all the details of your hypothesis. Make specific note of how long automated remediations take and what your customers experienced in the interim. And if you do need to intervene to restore service, take especially detailed notes on the actions of you and your fellow participants. Throughout, make sure the note taker can capture your observations and post screenshots of the graphs you’re examining.

At this time, your development environment should have returned to a steady state. Take stock. If your automated remediations didn’t detect the failure or they otherwise malfunctioned in some way, you should probably stop here. If the failure was too noticeable to customers (however you extrapolate that from your development environment) or was noticeable for too long, you may decide to stop here. If you assess the risk and decide to abort, announce that wherever you announced the beginning of the exercise. See Example 4-2 for what such a rare retreat looks like at Slack.

Example 4-2. An aborted Disasterpiece Theater announcement at Slack

Richard Crowley 11:22 AM Disasterpiece Theater has ended for the day without even making it to prod.

When the exercise in your development environment goes as planned, you get to announce that the exercise is moving on to your production environment. See Example 4-3 for what a typical announcement looks like at Slack.

Example 4-3. A typical announcement when Disasterpiece Theater moves on to prod

Richard Crowley 10:10 AM #disasterpiece-theater did two rounds in dev and is done there. Now we’re moving on to prod. Expect a bunch of channels to be redistributed in the Channel Server ring in the near future. I’ll post again when all’s clear.

This is the real moment of truth. All the preparation and the exercise in the development environment have led you to this moment in which one of your cohosts should incite the failure in the production environment using the steps or command prepared ahead of time. This will feel like nothing during some exercises and be truly terrifying during others. Take note of these feelings—they’re telling you where the risk lies in your systems.

Once again, it’s time for all the participants (except the note taker) to spring into action gathering evidence of the failure, the recovery, and the impact on adjacent systems. The evidence tends to be much more interesting this time around with real customer traffic on the line. Confirm or disconfirm your hypothesis in production. Watch the system respond to the fault, noting the time that you observe automated remediation. And, of course, if you need to intervene to restore service, do so quickly and decisively—your customers are counting on you! Here, too, make sure the note taker captures your observations and post screenshots of the graphs you’re examining.

When your production environment has returned to its steady state, give the all clear in the same place you announced the exercise in your development environment and the transition to your production environment. If you can make any preliminary comment on the success of the exercise, that’s great, but at a minimum the announcement keeps teammates making changes in production situationally aware.

Before all the participants scatter to the winds, take time for some immediate sense-making. That is, take time to understand or at least document any lingering ambiguity about the exercise.

Debriefing

Shortly after the exercise while memories are still fresh and high fidelity, I like to summarize the exercise—just the facts—for a wide audience. It helps to form a narrative around the summary that explains why the failure mode being exercised is important, how systems tolerated (or didn’t) the failure, and what that means for customers and the business. It also serves to reinforce to the rest of the company why doing these exercises is so important. My original Disasterpiece Theater checklist offers the following prompts:

What were the time to detect and time to recover?
Did any users notice? How do we know? How can we get that to “no”?
What did humans have to do that computers should have done?
Where are we blind?
Where are our dashboards and docs wrong?
What do we need to practice more often?
What would on-call engineers have to do if this happened unexpectedly?

We capture the answers to these questions in Slack or in a summary document shared in Slack. More recently we’ve started recording audio from exercises and archiving them for posterity, too.

After the summary, the host offers conclusions and recommendations on behalf of the exercise. Your job, as host of the exercise, is to draw these conclusions and make these recommendations in service of the reliability of the system and the quality of the development environment based on the evidence presented dispassionately in the summary. These recommendations take on elevated importance when the exercise did not go according to plan. If even the most expert minds incorrectly or incompletely understood the system before the exercise, it’s likely that everyone else is even further off. This is your opportunity to improve everyone’s understanding.

The debriefing and its outputs offer yet another opportunity to influence your organization by educating even more people about the kinds of failures that can happen in production and the techniques your organization uses to tolerate them. In fact, this benefit is remarkably similar to one of the benefits of publishing detailed incident postmortems internally.

How the Process Has Evolved

Disasterpiece Theater was initially conceived as a complement to the incident response process and even a forum for practicing incident response. Early lists of potential exercises included quite a few failures that were known even at the time to require human intervention. This was at least theoretically acceptable because those failure modes were also the sort that relied on assumptions that may have been invalidated as the environment evolved.

More than a year later, Slack has never run a Disasterpiece Theater exercise that planned on human intervention being necessary, though there have been cases in which human intervention was necessary, nonetheless. Instead, we have developed another program for practicing incident response: Incident Management Lunch. It’s a game in which a group of people try to feed themselves by following the incident response process. They periodically draw cards that introduce curveballs like sudden restaurant closures, allergies, and picky eaters. Thanks to this practice and the training that precedes it, Disasterpiece Theater no longer needs to fill this void.

Disasterpiece Theater has evolved in a few other ways, too. The earliest iterations were entirely focused on results and left a lot of educational opportunities on the table. The debriefings and, especially, the written summaries, conclusions, and recommendations were introduced specifically for their educational value. Likewise, the recent introduction of recordings allows future observers to go deeper than they can with the summary and chat history alone.

It can be tough for a remote participant on a video conference to follow who’s speaking, doubly so if they cannot see the video because someone’s sharing their screen. That’s why I started Disasterpiece Theater recommending against screen sharing. On the other hand, it can be incredibly powerful to all look at the same graph together. I’m still searching for the right balance between screen sharing and video that creates the best experience for remote participants.

Finally, my original Disasterpiece Theater checklist prompted the hosts to come up with synthetic requests they could make in a tight loop to visualize the fault and the tolerance. This practice never turned out to be as useful as a well-curated dashboard that covered request and error rate, a latency histogram, and so on. I’ve removed this prompt from the checklist at Slack to streamline the process.

These certainly won’t be the last evolutions of this process at Slack. If you adopt a similar process at your company, pay attention to what feels awkward to smooth it out and who’s not getting value to make the process more inclusive.

Getting Management Buy-In

Once again, a narrative is key. You might begin with a rhetorical device: “Hi there, CTO and VP of Engineering. Wouldn’t you like to know how well our system tolerates database master failure, network partitions, and power failures?” Paint a picture that includes some unknowns.

And then bring the uncomfortable truth. The only way to understand how a system copes with a failure in production is to have a failure in production. I should admit here that this was an incredibly easy sell to Slack’s executives who already believed this to be true.

In general, though, any responsible executive will need to see evidence that you’re managing risks effectively and appropriately. The Disasterpiece Theater process is designed specifically to meet this bar. Emphasize that these exercises are meticulously planned and controlled to maximize learning and minimize (or, better yet, eliminate) customer impact.

Then plan your first exercise and show off some results like the ones in the next section.

Results

I’ve run dozens of Disasterpiece Theater exercises at Slack. The majority of them have gone roughly according to plan, expanding our confidence in existing systems and proving the correct functioning of new ones. Some, though, have identified serious vulnerabilities to the availability or correctness of Slack and given us the opportunity to fix them before impacting customers.

Avoid Cache Inconsistency

The first time Disasterpiece Theater turned its attention to Memcached it was to demonstrate in production that automatic instance replacement worked properly. The exercise was simple, opting to disconnect a Memcached instance from the network to observe a spare take its place. Next, we restored its network connectivity and terminated the replacement instance.

During our review of the plan we recognized a vulnerability in the instance replacement algorithm and soon confirmed its existence in the development environment. As it was originally implemented, if an instance loses its lease on a range of cache keys and then gets that same lease back, it does not flush its cache entries. However, in this case, another instance had served that range of cache keys in the interim, meaning the data in the original instance had become stale and possibly incorrect.

We addressed this in the exercise by manually flushing the cache at the appropriate moment and then, immediately after the exercise, changed the algorithm and tested it again. Without this result, we may have lived unknowingly with a small risk of cache corruption for quite a while.

Try, Try Again (for Safety)

In early 2019 we planned a series of ten exercises to demonstrate Slack’s tolerance of zonal failures and network partitions in AWS. One of these exercises concerned Channel Server, a system responsible for broadcasting newly sent messages and metadata to all connected Slack client WebSockets. The goal was simply to partition 25% of the Channel Servers from the network to observe that the failures were detected and the instances were replaced by spares.

The first attempt to create this network partition failed to fully account for the overlay network that provides transparent transit encryption. In effect, we isolated each Channel Server far more than anticipated creating a situation closer to disconnecting them from the network than a network partition. We stopped early to regroup and get the network partition just right.

The second attempt showed promise but was also ended before reaching production. This exercise did offer a positive result, though. It showed Consul was quite adept at routing around network partitions. This inspired confidence but doomed this exercise because none of the Channel Servers actually failed.

The third and final attempt finally brought along a complete arsenal of iptables(8) rules and succeeded in partitioning 25% of the Channel Servers from the network. Consul detected the failures quickly and replacements were thrown into action. Most importantly, the load this massive automated reconfiguration brought on the Slack API was well within that system’s capacity. At the end of a long road, it was positive results all around!

Impossibility Result

There have also been negative results. Once, while responding to an incident, we were forced to make and deploy a code change to effect a configuration change because the system meant to be used to make that configuration change, an internally developed system called Confabulator, didn’t work. I thought this was worthy of further investigation. The maintainers and I planned an exercise to directly mimic the situation we encountered. Confabulator would be partitioned from the Slack service but otherwise be left completely intact. Then we would try to make a no-op configuration change.

We reproduced the error without any trouble and started tracing through our code. It didn’t take too long to find the problem. The system’s authors anticipated the situation in which Slack itself was down and thus unable to validate the proposed configuration change; they offered an emergency mode that skipped that validation. However, both normal and emergency modes attempted to post a notice of the configuration change to a Slack channel. There was no timeout on this action but there was a timeout on the overall configuration API. As a result, even in emergency mode, the request could never make it as far as making the configuration change if Slack itself was down. Since then we’ve made many improvements to code and configuration deploys and have audited timeout and retry policies in these critical systems.

Conclusion

The discoveries made during these exercises and the improvements to Slack’s reliability they inspired were only possible because Disasterpiece Theater gave us a clear process for testing the fault tolerance of our production systems.

Disasterpiece Theater exercises are meticulously planned failures that are introduced in the development environment and then, if that goes well, in the production environment by a group of experts all gathered together. It helps minimize the risk inherent in testing fault tolerance, especially when it’s based on assumptions made long ago in older systems that maybe weren’t originally designed to be so fault tolerant.

The process is intended to motivate investment in development environments that faithfully match the production environment and to drive reliability improvements throughout complex systems.

Your organization and systems will be better on a regular cadence of Disasterpiece Theater exercises. Your confidence that when something works in the development environment it will also work in the production environment should be higher. You should be able to regularly validate assumptions from long ago to stave off bit rot. And your organization should have a better understanding of risk, especially when it comes to systems that require human intervention to recover from failure. Most importantly, though, Disasterpiece Theater should be a convincing motivator for your organization to invest in fault tolerance.

Get Chaos Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chaos Engineering by Casey Rosenthal, Nora Jones