At 10:30 a.m. Pacific time, eight hours after the outage started, Tom, our account representative, called me to come down for a post-mortem. Because the failure occurred so soon after the database failover and maintenance, suspicion naturally condensed around that action. In operations, “post hoc, ergo propter hoc” turns out to be a good starting point most of the time. It’s not always right, but it certainly provides a place to begin looking. In fact, when Tom called me, he asked me to fly there to find out why the database failover caused this outage.
Once I was airborne, I started reviewing the problem ticket and preliminary incident report on my laptop.
My agenda was simple: conduct a post-mortem investigation, and ...