Incident management

Intermediate

Protocols and practice

Dealing with an incident or outage can be one of the most stressful aspects of supporting any service. If you’re part of a site reliability engineering team, you know that outages or production incidents are unavoidable—no matter how resilient your system. These outages negatively affect the business, revenue, user satisfaction, and your team’s work life. The good news is that there are many practices you can use to alleviate the negative impact of these outages and reduce the time it takes to fix them.

Join expert Cindy Quach to learn the fundamentals of incident management to help you respond to outages quicker, minimize the effects of the damage, and understand what you can learn from various incidents to help prevent them in the future. Along the way, you’ll gain the skills to bring incident management practices into your organization and maintain them consistently across individuals and teams.

What you’ll learn and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

Each step in the production incident cycle
How to detect incidents quickly—using effective alerts, service level indicators (SLIs) and service level objectives (SLOs)
The necessary steps to minimize the time it takes to repair an incident
When to use incident management protocols

And you’ll be able to:

Incorporate incident management into healthy postmortem practices
Identify incident commanders, communicators, and other key roles for incident management

This live event is for you because...

You’re a site reliability engineer or work in DevOps, systems engineering, or system administration.
You manage site reliability engineers.

Prerequisites

Familiarity with an on-call system

Recommended preparation:

Read Managing Incidents (Chapter 14 in Site Reliability Engineering)

Recommended follow-up:

Read Site Reliability Engineering (book)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Fundamentals of incident management (55 minutes)

Lecture: Overview of incident management; defining an incident; understanding how to minimize the timeframe of an incident
Hands-on exercise: Identify effective and ineffective postmortem action items
Q&A

Break (5 minutes)

Practical incident management (55 minutes)

Lecture: Incident management and common pitfalls to avoid
Hands-on exercise: Implement practices from incident management at Google
Q&A

Break (5 minutes)

Incident management and beyond (20 minutes)

Lecture: Overview of other techniques and key performance indicators

Wrap-up and Q&A (10 minutes)

Your Instructor

Cindy Quach
Cindy Quach is a site reliability engineer at Google, where she helps Google Cloud customers adopt SRE practices and principles to help them scale their services. In her time at Google, she’s worked on products such as the company’s internal Linux distribution and on its mobile infrastructure and virtualization teams.

search

Skill covered

Leadership and Management

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills