We worked as a team across all the workstreams to make sure that we stayed available in every instance. We found ways for people who were generally not affected by outages to be the first line of defense for external queries, allowing everyone to own the process. We re-learned that in the end, most of handling an outage is about communication, both internal to the technical group as well as to the stakeholders, and most importantly to the users. As a team-building exercise, the shared burden was in my eyes a wonderful success.
There are technical failures and process failures, and in testing the former, we found how to improve on the latter.
Forcing a timeline meant that we were able to focus on the things that matter. In our case, we had a deadline that couldn’t possibly move. However, that wasn’t the deadline that actually mattered for defining that focus; the game day date was the real cut off. Manufactured deadlines can be just as effective at giving the focus, but some relation to reality is helpful. If you find yourself getting bogged down trying to failure proof 100% of your application, you’ll never get there. Do monthly game days and actively choose a small piece to have failure modes each time. Think of it this way, if you bite off 20% failure coverage for each game day, you’ll be in infinitely better shape than trying to failure proof everything before you test it.
Responding to these simulated and legitimate incidents helped us in innumerable ways, but perhaps the most valuable lesson was how it solidified the mechanics of our incident response process. An initial draft of a document on how incidents should be handled was written in June, 2011. However, it wasn’t until we had a mature team working through actual incidents that we were able to bridge that document to a process that worked for our team.
There were a couple of takeaways to the process side of dealing with an incident that are worth highlighting.
Having a defined chain of command is something that was absolutely necessary to getting anything done. Having an influx of a dozen people who are all smart and all have great ideas trying to steer a ship in a dozen directions is a recipe for disaster. There needs to be someone who can listen to people and say “Yes, this is what we are going to do.” Establish upfront who that person is.
Having one person who kept a running shared document (Google Docs works great for this) of the incident meant that communication to stakeholders was as easy as pointing them to a web link and answering whatever questions they had about it. Having that document also meant that as people joined the channel they were able to catch up on what the situation was without having to interrupt current problem solving happening in channel. Protip: change the subject of the room to what the current incident is, and add a link to the current incident document.
We also found some flaws in our process that we didn’t have viable solutions for. For instance, we would regularly have several different incidents that needed to happen at the same time. (Hilariously, less so during the game day. It turns out that as sadistic as I was, I was nicer than reality). As soon as there is a second incident, communication in a single channel would become overly confusing. Our work-around for this was to move the discussion for secondary incidents to other rooms that made sense, and to have someone (usually me) act as the dispatcher who would redirect conversations to the appropriate rooms. Not a scalable system, but we were able to make it through.
In addition, the results of the failures were documented per application and runbooks were generated. Runbooks are essentially incredibly simple “if this, then that” instructions for assorted ifs. We bundled these runbooks with the documentation and READMEs in the git repositories for the respective applications so that anyone responding to an incident could know how to respond.
In the future, I will use game days to inform the incident response process, not the other way around. Accepting that there are different responses for each application is something that needs to happen. While a baseline set of guidelines for responding to any incident is necessary, we also need a runbook for each application.
On the technical side, we were able to validate that many of our assumptions were correct and learned that some of them were incredibly far off. Our assumption that using a service-oriented architecture would mean that we could isolate cross-application issues to that layer was demonstrably false, hoisted by our own technical debt.
The most obvious take away from the game day was identifying where exactly we failed unexpectedly and where we had more to work on. We took this as direction on where to focus efforts for the next sprint and retested. We ended up doing a scaled back version of the game day nearly every week for the following three weeks to validate that we were getting more and more coverage in our failsafes.
The often repeated adage of “if you don’t measure it, it didn’t happen” bore true, but also led us to the corollary of “if you don’t alert, it doesn’t matter if you measure.” We learned that there was great benefit in measuring both success and failure and that alerting on the absence of success could find problems that only alerting on failure would miss.
As far as failovers, we found our basic structure was sound. For databases, on slave failure shift seamlessly to master; on master failure, fall back to slaves and fail to read only (or other fallbacks above the database level, like queuing writes where appropriate); and on failure of both master and replicants, fail to cache. This isn’t really novel in any way, and that’s the point. Having clearly defined failure modes was far more successful than trying to be clever and having it all appear to be working.
Having the API return headers explaining the health of the API was not necessarily obvious, but it was an incredibly helpful way to make sure that clients were able to do any higher level failure handling.
Feature flags that are configurable proved to be an incredibly important tool in our toolset. While simulating load, we pushed on a single endpoint that had very expensive queries, which had incredibly adverse effects on the database, specifically pegging its CPU usage. In an emergency like that, the ability to turn off a single endpoint instead of failing over the entire data infrastructure is invaluable. Our game day validated that some of our initial choices about how to implement those switches were sound. A first draft at implementing feature flags had the flags set in a configuration file in the code. Changing the flag would necessitate a full deploy. We switched to a simple flat file in Amazon’s S3 that would be cached in memcache for a set number of minutes with a known key. Having it be a simple file was an important decision. Being able to make the changes with only a text editor theoretically empowered even non-technical people to be able to make the change in the case of emergency. In reality, a lack of documentation about how to edit the file meant that the people who could actually make a worthwhile change were limited to those who were intimately familiar with the raw code. We also had feature flags across many of our apps but failed to standardize them in any way. More standardization and documentation would have increased our emergency bus number significantly.
Possibly the most important takeaway from this whole experience is that we should have been doing these exercises more often, and that we needed to continue doing them. And so we did, until we were confident that all of our systems and applications would act as we expected, even in failure, for election day.
And when election day weekend happened and the scale jumped two to five times every day for four days, we knew not only how to handle any problems, but we knew that our applications and infrastructure would react predictably.