Chapter 3. Risk Management

Operations is a set of promises and the work it takes to fulfill it. In Chapter 2, we discussed how to create, monitor, and report on them. Risk management is what we do to identify, assess, and prioritize the uncertainties that could cause us to violate these promises we’ve made. It is also the application of resources (technology, tools, people, and processes) to monitor, mitigate, and reduce the probability of these uncertainties coming to pass.

This is not a perfect science! The goal of this is not to eliminate all risks. That is a quixotic goal that will waste resources. The goal is to bake the assessment and mitigation of risk into all of our processes and to iteratively reduce the impact of risks through mitigation and prevention techniques. This process should be continually performed with inputs from observation of incidents, introduction of new architectural components, and the increased or decreased impact as an organization evolves. The cycle of this process can be broken down into seven categories:

  • Identify possible hazards/threats that create operational risk to the service

  • Conduct assessment of each risk, looking at likelihood and impacts

  • Categorize the likelihood and outcome of the risks

  • Identify controls for mitigating consequences or reducing likelihood of the risk

  • Prioritize which risks to tackle first

  • Implement controls and monitor effectiveness

  • Repeat process

By repeating this process, you are exercising Kaizen, or continuous ...

Get Database Reliability Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.