What is risk management?
Practical strategies for managing risk in your systems.
All complex systems have risk. It is an inevitable part of all systems. It is impossible to remove all risk from a complex system such as a web application. However, examining your risk and determining how much risk is acceptable is important in keeping your system healthy.
In the sections that follow, I’ll be providing an overview of what risk is and how we can identify it. And I’ll introduce a process called risk management, which helps us to reduce the effect of risk on our applications.
Risk management involves determining where the risk is within your system, determining which risks must be removed and which remain, and then mitigating the remaining risks to reduce their likelihood and severity.
When a risk triggers (or occurs), you or your system suffer a loss. This loss can be data lost by your company or a customer. It can be a lack of availability in your application by your customers. The loss can be invalid or erroneous results. Ultimately, any of these can result in your customers losing trust in your ability to manage their data and their business. This, ultimately, will cost you money.
However, you must weigh this loss against a competing aspect: What is the cost of removing the risk to prevent it from happening? Ultimately, risk management is balancing the cost of removing a risk with the cost of having the risk occur.
Your first step in managing risk is creating a list of all known risks, along with their severity and their likelihood of occurring.
We call this list a risk matrix, an example of which is shown in Figure 1.
Creating the matrix initially involves brainstorming. You can get ideas for what to put in your risk matrix from multiple sources:
- Collective wisdom of the developers
- Known high-support areas
- Known threat vectors or vulnerabilities
- Known areas where the system is incomplete or missing capabilities
- Known poor performance areas
- Known traffic spikes and patterns
- Specific concerns from business owners, support personnel, or users
- Known technical debt in your system
You will likely find that there are obvious entries in the list, but there should also be entries that surprise you. This is good. You want to uncover as many of your risk vulnerabilities as possible, and if some of them don’t come as a surprise to you, you probably haven’t dug deep enough.
Creating the risk matrix involves assigning prioritized values for the likelihood of the risk occurring and the impact (severity) of the problem caused if the risk does occur.
Remove Worst Offenders
After compiling your initial list, review it and identify the risk entries that are your worst unmitigated offenders. How do you know which risks are the worst offenders? Look for risks that occur often or risks that haven’t occurred yet but would cause serious problems to your system if they did. The absolute worst offenders are risks that are highly likely to occur or occur often and cause serious harm to your system. Chapter 6 discusses the difference between severity and likelihood, and how to use this information to help manage your risks. This information will help you find your worst offenders.
In Figure 1, an example risk that might be one of our worst offenders is “Frontend fails if user identity service is down.”
Once you’ve identified a few of the top offenders, add items to your roadmap to make sure these are addressed in a timely manner.
For all risks, whether they are the worst offenders or not, brainstorm if there are things you can do that will either reduce the frequency or likelihood of the risk occurring, or reduce the severity of the problem if the risk does occur. These things are called risk mitigators.
Risk mitigators can be highly valuable. You are especially looking for mitigators that will reduce the risk (either severity or likelihood or both) yet are simple or inexpensive to implement.
Let’s take a look at the risk “Frontend fails if user identity service is down” shown in Figure 1. For this risk, a potential mitigation to consider is to cache user identity information so that some information may be available for the frontend to use, even if the user identity service is down.
You can focus on your worst offenders, finding ways to reduce the severity of those risks. But also look at risks that you might not be able to fix any time soon. Finding mitigations to these risks which reduce the severity or likelihood can be nearly as valuable as fixing the risk altogether.
The risk matrix can quickly become stale if you don’t review it regularly. You should review your risk matrix as a team at least quarterly, but perhaps monthly for very active and highly critical systems. Additionally, review it after each incident. Was the incident properly covered by a known risk?
When you review the matrix, first look for new risks that have been recently introduced or newly identified. Add new entries for these risks. Also, remove old entries for items that are no longer risks.
Then look for severity or likelihood changes. Often, mitigations were helpful and managed to reduce the severity or likelihood. Or, more knowledge has come forward that makes a risk either more likely to occur or perhaps more severe. This is frequently the case if a risk actually triggered since your last review; you might feel that a risk marked as a low likelihood that actually did occur should perhaps be restated as a risk with a higher likelihood. Now, are there risks that you can remove (fix) by putting them on your roadmap?
Finally, look for new or updated mitigations that you can put into play.