Incident management
Published by O'Reilly Media, Inc.
Protocols and practice
Dealing with an incident or outage can be one of the most stressful aspects of supporting any service. If you’re part of a site reliability engineering team, you know that outages or production incidents are unavoidable—no matter how resilient your system. These outages negatively affect the business, revenue, user satisfaction, and your team’s work life. The good news is that there are many practices you can use to alleviate the negative impact of these outages and reduce the time it takes to fix them.
Join expert Cindy Quach to learn the fundamentals of incident management to help you respond to outages quicker, minimize the effects of the damage, and understand what you can learn from various incidents to help prevent them in the future. Along the way, you’ll gain the skills to bring incident management practices into your organization and maintain them consistently across individuals and teams.
What you’ll learn and how you can apply it
By the end of this live, hands-on, online course, you’ll understand:
- Each step in the production incident cycle
- How to detect incidents quickly—using effective alerts, service level indicators (SLIs) and service level objectives (SLOs)
- The necessary steps to minimize the time it takes to repair an incident
- When to use incident management protocols
And you’ll be able to:
- Incorporate incident management into healthy postmortem practices
- Identify incident commanders, communicators, and other key roles for incident management
This live event is for you because...
- You’re a site reliability engineer or work in DevOps, systems engineering, or system administration.
- You manage site reliability engineers.
Prerequisites
- Familiarity with an on-call system
Recommended preparation:
- Read Managing Incidents (Chapter 14 in Site Reliability Engineering)
Recommended follow-up:
- Read Site Reliability Engineering (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Fundamentals of incident management (55 minutes)
- Lecture: Overview of incident management; defining an incident; understanding how to minimize the timeframe of an incident
- Hands-on exercise: Identify effective and ineffective postmortem action items
- Q&A
Break (5 minutes)
Practical incident management (55 minutes)
- Lecture: Incident management and common pitfalls to avoid
- Hands-on exercise: Implement practices from incident management at Google
- Q&A
Break (5 minutes)
Incident management and beyond (20 minutes)
- Lecture: Overview of other techniques and key performance indicators
Wrap-up and Q&A (10 minutes)
Your Instructor
Cindy Quach
Cindy Quach is a site reliability engineer at Google, where she helps Google Cloud customers adopt SRE practices and principles to help them scale their services. In her time at Google, she’s worked on products such as the company’s internal Linux distribution and on its mobile infrastructure and virtualization teams.