SRE incident response
Building a successful postincident culture
Incidents are costly. When your system goes down, you must work quickly, efficiently, and effectively to get things back up. The gold standard process is the incident management system (IMS), developed by American firefighters in the 1970s. IMS is now used by militaries, emergency personnel, and—in the domain of site reliability engineering (SRE)—companies like Google. Responding efficiently and effectively can make the difference between meeting your service-level objectives (SLOs) and blowing right past them—which is why effective incident response is a core pillar of SRE.
Just as important are the preparation done beforehand and the analysis that occurs afterward. During nonincident times, organizations should be safely testing how services may fail (such as with game days), planning who responds when things break, and crafting playbooks for common actions and responses. Postincident, measuring and evaluating incident response is crucial to determine what works and what doesn’t.
Incident Labs’ Emil Stolarsky and Jaime Woo show you how to create a successful incident response strategy, from preparation and training to running IMS during the incident to evaluating the response and sharing lessons learned throughout your organization. Our services will never be perfect, and they’ll all break eventually. What makes us SREs is how we prepare for those days when things break, how we respond, and what we learn.
What you'll learn-and how you can apply it
By the end of this live online course, you’ll understand:
- The importance of a centralized command structure for incident response
- The value of preparation for incident response through training
- Why resilience engineering principles are key to successful postincident response
And you’ll be able to:
- Run IMS within your company’s incident response strategy
- Prepare for incidents through the use of game days and chaos engineering
- Run effective postincident review meetings and share those lessons company wide
This training course is for you because...
- You’re an operator or SRE who wants to better respond to incidents.
- You work within a company that does incident response well but could be better.
- You want to become a leader inside your company in helping teams learn from incidents.
- An understanding of core SRE principles (as covered by either of the recommended resources below)
- Read “Introduction” (chapter in Site Reliability Engineering)
- Take Site Reliability Engineering Fundamentals (live online training course with Jaime Woo and Emil Stolarsky)
- Read Site Reliability Engineering (book)
- Read The Site Reliability Workbook (book)
- Read Seeking SRE (book)
- Watch Spotlight on Cloud: Reducing the Impact of Service Outages with Generic Mitigations with Jennifer Mace (video)
- Read Chaos Engineering (book)
About your instructors
Emil Stolarsky is a site reliability engineer. Previously, he worked on caching, performance, and disaster recovery at Shopify and the internal Kubernetes platform at DigitalOcean. He’s the program cochair for SREcon EMEA 2019 and SREcon Americas West 2020 and contributed a chapter to the O’Reilly book Seeking SRE.
Jaime Woo is an award-nominated writer, and is a frequent speaker at SREcon EMEA, Americas West, and Americas East. He started his career as a molecular biologist, before working at DigitalOcean, Riot Games, and Shopify, where he launched the engineering communications function.
The timeframes are only estimates and may vary according to how the class is progressing
During the incident (30 minutes)
- Presentation: What is IMS?; getting organizational buy-in, including business goals and tooling
- Hands-on exercise: Explore a quick example of IMS
Before the incident (40 minutes)
- Presentation: Preparation through training
- Hands-on exercise: Work through on-call schedules
Break (5 minutes)
After the incident (45 minutes)
- Presentation: What is a successful postincident meeting?; How does an organization learn from incidents?
- Group discussion: Key metrics after an incident