At 10:30 a.m. Pacific time, eight hours after the outage started, Tom,[10] our account representative, called me to come down for a post-mortem. Because the failure occurred so soon after the database failover and maintenance, suspicion naturally condensed around that action. In operations, “post hoc, ergo propter hoc”[11] turns out to be a good starting point most of the time. It’s not always right, but it certainly provides a place to begin looking. In fact, when Tom called me, he asked me to fly there to find out why the database failover caused this outage.

Once I was airborne, I started reviewing the problem ticket and preliminary incident report on my laptop.

My agenda was simple: conduct a post-mortem investigation, ...

Get Release It! now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.