Chapter 4. Mitigation and Recovery
We’ve talked about scaling incident management, and using component responders and SoS responders to help manage incidents as your company scales. We’ve also covered the characteristics of a successful incident response organization, and discussed managing risk and preventing on-call burnout. Here, we talk about recovery after an incident has occurred. We’ll start by focusing on urgent mitigations.
Urgent Mitigations
Previously, we encouraged you to “stop the bleeding” during a service incident. We also established that recovery includes the urgent mitigations1 needed in order to avoid impact or prevent growth in impact severity. Let’s touch on what that means and some ways to make mitigation easier during urgent circumstances.
Imagine that your service is having a bad time. The outage has begun, it’s been detected, it’s causing user impact, and you’re at the helm. Your first priority should always be to stop or lessen the user impact, not to figure out what’s causing the issue. Imagine you’re in a house and the roof begins to leak. The first thing you’re likely to do is place a bucket under the dripping water to prevent further water damage, before you grab your roofing supplies and head upstairs to figure out what’s causing the leak. (As we’ll find out later, if the roofing failures are the root cause, the rain is the trigger.) The bucket reduces the impact until the roof is fixed and the sky clears. To stop or lessen user impact during a ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access