O'Reilly logo
live online training icon Live Online training

Running Effective Postincident Reviews in SRE

Topic: Security
Jaime Woo
Emil Stolarsky

After an incident, it’s not enough to simply write a postmortem. The best organizations have a learning culture that can draw lessons from failures and distribute them across the entire company to level everyone up. Organizations will never implement their systems perfectly. Instead, it’s critical to build and establish the cultures necessary to facilitate learning after an incident.

Join Incident Labs’ Emil Stolarsky and Jaime Woo to discover how teams can build an effective postincident review culture where everyone understands the value of the process. You’ll learn how to gather all the information necessary to build a full picture with informational interviews, develop a culture where employees feel safe expressing their concerns and sharing sources of potential issues, and explore best practices for ensuring that the lessons learned from the hard work put into the postincident review process are fed back into the organization to improve over time.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • The hallmarks of a learning culture
  • The steps to building a learning culture
  • What a blame-aware culture is and the common setbacks that prevent a truly blame-aware culture
  • The roles and responsibilities in a postmortem
  • Root-cause analysis and the five whys (and why these are considered shallower forms of analyses)
  • How to look at the human factors in an incident and postincident

And you’ll be able to:

  • Share with stakeholders the importance of a learning organization, postmortems, and a blame-aware culture
  • Identify key metrics for analysis to understand the quality of incident response
  • Learn multiple ways to run a postmortem and implement the lessons learned within your organization
  • Ensure that your learning organization fits your culture

This training course is for you because...

You understand that high-performing companies must be learning organizations, where you analyze failures and implement lessons learned. - You need to create a blame-aware culture so that lessons can surface without fear of retribution or punishment. - You need to know how to run a proper postmortem so that it doesn’t get ignored or sidelined.

Prerequisites

  • Experience running software in production environments
  • Familiarity with postincident reviews (also known as RCAs, postmortems, etc.)

Recommended preparation:

Recommended follow-up:

About your instructors

  • Jaime Woo is an award-nominated writer, and is a frequent speaker at SREcon EMEA, Americas West, and Americas East. He started his career as a molecular biologist, before working at DigitalOcean, Riot Games, and Shopify, where he launched the engineering communications function.

  • Emil Stolarsky is a site reliability engineer. Previously, he worked on caching, performance, and disaster recovery at Shopify and the internal Kubernetes platform at DigitalOcean. He’s the program cochair for SREcon EMEA 2019 and SREcon Americas West 2020 and contributed a chapter to the O’Reilly book Seeking SRE.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction to postincident reviews (60 minutes)

  • Group discussion: How does your company handle postincident reviews?
  • Presentation: The importance of postincident reviews; the stages of a postincident review; using the right incident data for effective reports; the power of storytelling (and how to do it right)
  • Hands-on exercise: Generate informative interview questions
  • Q&A

Break (5 minutes)

Putting postincident reviews to work (60 minutes)

  • Presentation: Facilitating postincident review meetings; the ineffectiveness of looking for a root cause or the five whys; evaluating your incident review culture; the politics of incident reviews
  • Group discussion: Which incident review faux pas have you committed?
  • Hands-on exercise: Identify how to improve your postincident review
  • Q&A