Learning from First Responders: When Your Systems Have to Work

Planning

Moving election day back a bit to accommodate this plan was a non-starter, as you might imagine. A concerted and hasty effort had to be organized to pull off a game day only a month prior to the election. The management team dedicated time to trading horses and smoothing feathers to buy the engineering team time to be able to prepare for failure.

The eventual compromise that was reached for buying that preparation time was finishing up another couple of weeks of features followed by a considerable (for the campaign) two-week feature freeze to code around failure states. This was done both to keep the engineers on their toes as well as keep the teams from shifting their focus too early. Engineers weren’t told about the game day until about two weeks before it would take place — in other words, not until the feature freeze time were the teams spending their effort on preparation.

The teams were informed on October 2nd that game day would take place on the 19th of October. The teams had 17 days to failure-proof the software they had been building for 16 months. There was absolutely no way the teams could failure proof everything, and that was a good thing. If it wasn’t absolutely important for our “Get Out The Vote” efforts, the application should fail back to a simple read-only version or repoint the DNS to the homepage and call it a day. Because of this hard deadline, the teams did not waste time solving problems that didn’t matter to the core functionality of their applications.

Lead engineers, management, and project managers met with stakeholders and put together a list of the applications that needed to be covered and more importantly, the specific functionality that needed to be covered in each. Any scope creep that may have been pervasive in the initial development easily went by the wayside when talking about what absolutely needed to work in an emergency situation. In a way, this strict timeline forced the management into a situation where the infrastructure’s “hierarchy of needs” was defined and allowed everyone involved to relentlessly focus on those needs during the exercise. In a normal organization where time is more or less infinite, it’s very easy to attempt to bite off too much work or get bogged down in endless minutiae during this process. Consider imposing strict deadlines on a game day to enforce this level of focus.

For example, features that motivate and build communities around phone banking were incredibly important and a vital piece to the growth of the OFA Call Tool. However, those same features could easily be shed in an emergency if it meant that people could continue to actually make phone calls — the core purpose of the application. While each application’s core functionality had to be identified and made failure resistant, it was also beneficial to define exactly which features could be shed gracefully during a failure event.

The feature set hierarchy of needs was compared against our infrastructure to determine what pieces of the infrastructure it relied on, and how reliable that was deemed. For example, given a function that relied on writing information to a database, we would have to first determine how reliable that database was.

In our case, we used Amazon Relational Database Service (RDS) which took a lot of the simple database failure worries out of the way. However, there were still plenty of ways that we could run into database problems — replicant failures could force all traffic to the master and overrun it, endpoints that aren’t optimized could be exercised at rates we had never seen, RDS itself could have issues or worse, EBS (Amazon’s Elastic Block Storage) could have issues, we could have scaled the API so high that we would exhaust connections to the database.

With that many possible paths to failure, it would be considered risky, so we would either need an alternate write path or to find a way to get agreement on that functionality being non-essential. In our case, we relied heavily on Amazon Simple Queue Service (SQS) for delayed writes in our applications, so this would be a reasonable first approach. If the data being written was something that could not be queued or where queueing would introduce enough confusion that it would be more prudent to not write (password changes fell into that category), those features were simply disabled during the outage.

Alternately, given a function that needed to read data, we would go through the same assessment, determine that the database is risky and asses the options for fallbacks. In our case that usually meant caching in Amazon ElastiCache on the API side as well as in each of the clients. The dual caches and the database fallback was together a pretty reliable setup, but on the off chance that the database and the caches failed, we would be stuck. At this point, we would either need to determine other fallbacks (reading out of a static file in S3, a completely different region) or determine if this was a viable failure point for this function.

The teams spent a frantic two weeks implementing fail-safes for as many core features as possible. The service-based architecture that backed nearly every application allowed for this to be a much simpler task than it would have been with monolithic applications. Constraining the abstraction of failures in downstream applications and infrastructures to the API layer made it so that attention could be concentrated almost entirely in a single project. The major benefit of this was that applications higher in the stack could focus more on problems within their own domain rather than duplicate efforts on complex and tedious problems.

As this frantic sprint came to an end, the engineers had made some rather extreme but simple changes to how failures were handled that should allow for various dependencies to fail without hurting everything. People were still putting finishing touches on their changes when they learned in a team meeting that game day would take place on Sunday instead of Friday. That was the only information the engineers received. Management was purposefully vague about the particulars to avoid “studying for the test” rather than learning the material, as it were.

On Saturday, the team was sent a schedule for the game day. The schedule outlined the different failures that would be simulated, in what order, what the response plan should be, and when each test would begin and end.

Almost everything in the email was a lie.

Get Learning from First Responders: When Your Systems Have to Work now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Learning from First Responders: When Your Systems Have to Work by Dylan Richard

Planning

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly