Chapter 5. Postmortems and Beyond
In the previous chapter, we covered several things you can do to reduce customer impact, in terms of both technology and people, since both affect the time to detect, the time to mitigate/recover, and the time between failures. In this section, we talk about what happens after an incident has concluded: writing postmortems and using them as a powerful tool to analyze what went wrong and learn from mistakes.
After an incident has concluded, how do you know where to focus your efforts to minimize future incidents? To know what you should focus on, we recommend taking a data-driven approach (Figure 5-1). The data can be a result of a risk analysis process, or the measurements we mentioned earlier. It’s important to rely on data collected from postmortems and learnings from previous incidents that impacted customers.
Once you have a critical mass of postmortems, you can identify patterns. It’s important to let the postmortems be your guide; the investments in analyzing failure can lead you to success. For that purpose, we recommend creating a shared repository and sharing the postmortems broadly across internal teams.
It’s hard to talk about postmortems without discussing psychological safety. Therefore, before diving into the details of writing postmortems, let’s first talk about ...