Chapter 3. Scaling Incident Management (Response)
We’ve discussed practicing incident response readiness by conducting incident response exercises, role-playing, and running regular tests. These tactics help you get ready for when a real incident occurs and you start managing the incident (see “Establish an organized incident response procedure”). But how do you manage incidents once your organization starts to grow? In this section, we discuss how to scale incident management.
At Google, we’re set up to provide optimal incident management coverage for all systems. Google’s gotten really big. To serve more than 2 trillion queries per year, we leverage a lot of data centers, at least a million computers, and more than 80,000 employees. All this activity is routed through a massive and highly interconnected system-of-systems (SoS), critically reliant on its technical stack to be in active production. The support of this technical stack implies that appropriate personnel are reliably available in order to troubleshoot and correct issues as they arise. These are the responders in our Site Reliability Engineering organization; they provide incident management coverage for systems and respond when an incident occurs.
Component Responders
Within the Site Reliability Engineering organization, we also have component responders, who are incident responders on call for one component or system within Google’s overall technical infrastructure (Figure 3-1).
Get Anatomy of an Incident now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.