Post-Incident Reviews

Book description

Anyone who works with technology knows that eventually something will go wrong—even in today’s complex, distributed, and highly available IT systems. In the battle to maintain uninterrupted service, your DevOps teams require updated methods for detecting and solving problems fast. In this report, author Jason Hand explains that effective post-incident reviews today encourage team members to play a key role in continuously improving the system.

Traditional techniques for conducting post-incident analyses don’t work well in modern IT organizations, mainly because the command-and-control approach offers team members no incentive to explore the system and detect flaws when they occur. This report presents an up-to-date approach to post-incident reviews that embraces the human element and adds more eyes for discovering system flaws and potential improvements.

  • Understand why sustained success depends on a core value of continuous improvement
  • Examine why traditional post-incident approaches, such as Root Cause Analysis, do little to provide greater availability and reliability of IT services
  • Understand the role that team members can play in discovering system flaws
  • Learn why it’s often difficult to determine the cause and effect of outages in complex systems
  • Get a case study that examines the unique phases of an outage incident
  • Explore post-incident analysis in depth by moving away from causes and going deeper into the phases of the incident lifecycle

Table of contents

  1. Foreword
  2. Introduction
    1. Incident Detection
    2. Incident Response
    3. Incident Remediation
    4. Incident Analysis
    5. Incident Readiness
    6. Acknowledgments
  3. 1. Broken Incentives and Initiatives
    1. Control
    2. A Systems Thinking Lens
  4. 2. Old-View Thinking
    1. What’s Broken?
    2. The Way We’ve Always Done It
      1. Sample RCA (Using the “5 Whys” Format)
    3. Change
  5. 3. Embracing the Human Elements
    1. Celebrate Discovery
    2. Transparency
      1. Make Work (and Analysis) Visible
  6. 4. Understanding Cause and Effect
    1. Cynefin
    2. From Sense-Making to Explanation
    3. Evaluation Models
  7. 5. Continuous Improvement
    1. Creating Flow
    2. Eliminating Waste
    3. Feedback Loops
      1. Retrospectives
      2. Learning Reviews
      3. Objectives
  8. 6. Outage: A Case Study Examining the Unique Phases of an Incident
    1. Day One
      1. Detection
      2. Response
      3. Remediation
    2. Day Two
      1. Analysis
      2. Recap
  9. 7. The Approach: Facilitating Improvements
    1. Discovering Areas of Improvement
    2. Facilitating Improvements in Development and Operational Processes
      1. Identifying Trade-offs and Shortcomings in IT
  10. 8. Defining an Incident and Its Lifecycle
    1. Severity and Priority
      1. Priority
      2. Severity
    2. Lifecycle of an Incident
      1. Detection
      2. Response
      3. Remediation
      4. Analysis
      5. Readiness
  11. 9. Conducting a Post-Incident Review
    1. Who
      1. The Facilitator
    2. What
    3. When
    4. Where
    5. How
      1. Establish a Timeline
      2. Human Interactions
      3. Remediation Tasks
      4. ChatOps
      5. Metrics
      6. Time to Acknowledge (TTA) and Time to Recover (TTR)
      7. Status Pages
      8. Severity and Impact
      9. Contributing Factors
      10. Action Items
    6. Internal and External Reports
  12. 10. Templates and Guides
    1. Sample Guide
      1. Establish and Document the Timeline
      2. Plot Tasks and Impacts
      3. Learnings
      4. Contributing Factors
      5. Action Items
      6. Summaries and Public Reports
  13. 11. Readiness
    1. Next Best Steps

Product information

  • Title: Post-Incident Reviews
  • Author(s): Jason Hand
  • Release date: August 2017
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491986950