Taming chaos: Preparing for your next incident

Tim Craig and Gustavo Franco on establishing robust and well-supported incident response processes.

By Andy Oram

August 1, 2019

Siren light (source: Pixabay)

Incident response is like security: when you do it well, no one notices because everything just works the way it should. With both incident response and security, the costs are obvious to the organization while the benefits remain amorphous. And as with security, a lack of attention to incident response could be regretted—you can lose a good deal of money while your systems are non-functional. Even worse, your clients and customers might lose confidence in your product or organization, potentially costing you the entire business.

In this interview, Tim Craig and fellow Googler Gustavo Franco, a site reliability engineer (SRE), discuss the wide range of events that qualify as “incidents;” the need for a conscious, robust, and well-defined process for understanding them; the role of training; and how to get buy-in from management so you can spread incident response training throughout an organization.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

The concept of incident response is very broad, according to Craig. It goes far beyond the people usually considered, such as SREs and network administrators. Imagine a major incident requires a statement from an executive who happens to be eight time zones away or on an airplane. Can you reach them quickly? Are your legal and PR teams ready? Thus, incident response can cross many teams and involve an entire organization.

Craig may also surprise you by elevating processes above tools. But this is natural because incident response is an organizational issue, not just a technical one. Tools are important, of course—more on that in a moment—but people are even more important. When you choose the types of incidents for which you need to train and prepare, consider not just what’s most important to the business, but also what can best teach your staff.

Ideally, when a disaster happens, everybody who can help will immediately take their places and perform a useful role, like the crew of an airplane or ship. This requires regular training, just as people who earn first aid and CPR badges must complete ongoing follow-up training.

Tools enter the picture in order to automate disaster recovery testing. Teams create software to aid as much as possible in the design, scheduling, and evaluation of side effects from tests. Incident response is often practiced and measured during these tests as well. Automation is important for several reasons:

To scale up and protect as many aspects of your systems as possible, you need to shorten the amount of time a responder requires to create and run tests.
To persuade busy coworkers to adopt incident response, you need to make it easy for them.
To get approval for the incident response program from managers, you need to minimize costs, an aspect with which automation can help.

Many organizations can start small incident response initiatives without approval from higher management. But to move beyond isolated teams and to try out incident response where it really matters—on the production systems facing your clients—you’ll need buy-in from high up in an organization.

Craig endorses starting small and simple. He reminds the audience that Google has been practicing its current incident response program for 15 years. Don’t try to attain Google’s level of organization and automation from the start. It can be useful just to sit with the people responsible for handling incidents and talking through what they do, while making notes on paper. The chapter on incident response in The Site Reliability Workbook offers additional detail.

Craig and Franco use a couple abbreviations that I’ll define here:

MTTM: Median time to mitigate, one of several terms designating how long you can take to recover from an incident.
SLO: Service-level objective, part of a service-level agreement (SLA).

The term chaos engineering comes up during the interview, but Craig points out that incident response is a broader activity, and much more under the team’s control than the term “chaos” would suggest.

Take a listen for more interesting details about incident response processes that have worked at Google.

This post is a collaboration between O’Reilly and Google. See our statement of editorial independence.

Post topics: Operations