Chapter 3. Risk Management
Operations is a set of promises and the work it takes to fulfill it. In Chapter 2, we discussed how to create, monitor, and report on them. Risk management is what we do to identify, assess, and prioritize the uncertainties that could cause us to violate these promises weâve made. It is also the application of resources (technology, tools, people, and processes) to monitor, mitigate, and reduce the probability of these uncertainties coming to pass.
This is not a perfect science! The goal of this is not to eliminate all risks. That is a quixotic goal that will waste resources. The goal is to bake the assessment and mitigation of risk into all of our processes and to iteratively reduce the impact of risks through mitigation and prevention techniques. This process should be continually performed with inputs from observation of incidents, introduction of new architectural components, and the increased or decreased impact as an organization evolves. The cycle of this process can be broken down into seven categories:
Identify possible hazards/threats that create operational risk to the service
Conduct assessment of each risk, looking at likelihood and impacts
Categorize the likelihood and outcome of the risks
Identify controls for mitigating consequences or reducing likelihood of the risk
Prioritize which risks to tackle first
Implement controls and monitor effectiveness
Repeat process
By repeating this process, you are exercising Kaizen, or continuous ...
Get Database Reliability Engineering now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.