After an alert is triggered and the team has responded and remediated the situation, it is time to evaluate what happened. This is called a live site incident review. Here, the whole team gathers to address the following:
- What happened—to start, a timeline should be constructed from the time the incident was discovered to the point that normal operations were restored. Next, the timeline is expanded with the events that led to the situation that triggered the incident.
- Next, the series of events is evaluated to learn what worked well in the response. If one member of the team used a new tool to quickly diagnose a problem, this can benefit other members of the team as well.
- Only then is it time to look at the possible points ...