O'Reilly logo
live online training icon Live Online training

Incident Management

Protocols and practice

Cindy Quach

Dealing with an incident or outage can be one of the most stressful parts of supporting any service. Outages or production incidents negatively affect the business, revenue, user happiness and engineer cortisol levels and the kicker is that they are unavoidable. Outages will occur no matter how resilient you think your system is, there’s not much you can do about that. There are however many practices you can use to alleviate the negative impact and reduce the time it takes to fix an outage.

In this course, you’ll learn the fundamentals of incident management to help you respond to outages quicker, minimize the damage and learn from it to avoid it for the future.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • How the production incident cycle works
  • How to minimize the time to detect an incident using effective alerts, Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
  • How to minimize the time to repair an incident
  • How to reliably identify when to use incident-management protocols
  • How to identify incident commanders, communicators, and other key roles for incident management
  • How to maximize the time between outages using postmortems

And you’ll be able to:

  • Bring incident management practices to your organization
  • Maintain consistency across individuals and teams for incident management practices
  • Incorporate Incident Management into healthy postmortem practices

This training course is for you because...

  • You’re a site reliability engineer (SRE), or work in a related discipline: DevOps, Systems Engineering, System Administration
  • You manage SREs

Prerequisites

  • Familiarity with an oncall system

Recommended preparation:

Recommended follow-up:

About your instructor

  • Cindy Quach is a Site Reliability Engineer at Google, she’s worked as an SRE on various Google products such as the internal Linux distribution, mobile infrastructure and virtualization teams. She currently works on the customer reliability engineering team where she helps Google Cloud customers adopt SRE practices and principles to help them scale their services.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Part 1: Fundamentals of Incident Management (55 minutes)

  • Presentation: Overview of incident management, what is an incident, why we want to reduce the amount of time an incident takes and how we can accomplish that.
  • Exercise: Identify good and not so good postmortem action items.
  • Q&A
  • Break (5 minutes)

Part 2: Practical Incident Management (55 minutes)

  • Presentation: Hands-on incident management
  • Exercise: Walkthrough an incident using IMAG.
  • Q&A
  • Break (5 minutes)

Part 3: Incident Management and Beyond (30 minutes)

  • Presentation: Overview of other techniques you can use to reduce MTTD, MTTR, and MTBF.
  • Q&A