Chapter 11. Readiness

You can’t step in the same river twice.

Heraclitus (Greek Philosopher)

We never had a name for that huddle and discussion after I’d lost months’ worth of customer data. It was just, “Let’s talk about last night.” That was the first time I’d ever been a part of that kind of investigation into an IT-related problem.

At my previous company, we would perform RCAs following incidents like this. I didn’t know there was another way to go about it. We were able to determine a proximate cause to be a bug in a backup script unique to Open CRM installations on AWS. However, we all walked away with much more knowledge about how the system worked, armed with new action items to help us detect and recover from future problems like this much faster. As with the list of action items in Chapter 6, we set in motion many ways to improve the system as a whole rather than focusing solely on one distinct part of the system that failed under very unique circumstances.

It wasn’t until over two years later, after completely immersing myself in the DevOps community, that I realized the exercise we had performed (intentionally or not) was my very first post-incident review. I had already read blog posts and absorbed presentation after presentation about the absence of root cause in complex systems. But it wasn’t until I made the connection back to that first post-incident review that I realized it’s not about the report or discovering the root cause—it’s about learning more about ...

Get Post-Incident Reviews now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.