Chapter 6. Outage: A Case Study Examining the Unique Phases of an Incident

Day One

Detection

Around 4 p.m. Gary, a member of the support team for a growing company, begins receiving notifications from Twitter that the company is being mentioned more than usual. After wrapping up responding to a few support cases, Gary logs into Twitter and sees that several users are complaining that they are not able to access the service’s login page.

Gary then reaches out to Cathy, who happens to be the first engineer he sees online and logged into the company chat tool. She says she’ll take a look and reach out to others on the team if she can’t figure out what’s going on and fix it. Gary then files a ticket in the customer support system for follow-up and reporting.

Response

Cathy attempts to verify the complaint by accessing the login page herself. Sure enough, it’s throwing an error. She then proceeds to figure out which systems are affected and how to get access to them. After several minutes of searching her inbox she locates a Google Document explaining ...

Get Post-Incident Reviews now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.