What’s Missing from Postmortem Investigations and Write-Ups?
How would you feel if you had to write a postmortem containing statements like these?
“We were unable to resolve the outage as quickly as we would have hoped because our decision making was impacted by extreme stress.”
“We spent two hours repeatedly applying the fix that worked during the previous outage, only to find out that it made no difference in this one.”
“We did not communicate openly about an escalating outage that was caused by our botched deployment because we thought we were about to lose our jobs.”
While these scenarios are entirely realistic, I challenge the reader to find many postmortem write-ups that even hint at these “human factors.” A rare and notable exception might be Heroku’s “Widespread Application Outage”[1] from the April 21, 2011, “absolute disaster” of an EC2 outage, which dryly notes:
Once it became clear that this was going to be a lengthy outage, the Ops team instituted an emergency incident commander rotation of 8 hours per shift, keeping a fresh mind in charge of the situation at all time.
The absence of such statements from postmortem write-ups might be, in part, due to the social stigma associated with publicly acknowledging the contribution of human factors to outages. And yet, people dealing with outages are subject to physical exhaustion and psychological stress and suffer from communication breakdowns, not to mention impaired reasoning due to a host of cognitive biases.
What actually happens ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access