Building a scalable application that has high availability is not easy and does not come automatically. Problems can crop up in unexpected ways, that can cause your beautifully functioning application to stop working for all or some of your customers.
These availability problems often arise from the areas you least expect, and some of the most serious availability problems can originate from extremely benign sources. No one can anticipate where problems will come from, and no amount of testing will find all issues. Many of these are systemic problems, not merely code problems.
To find these availability problems, we need to step back and take a systemic look at your application and how it works. Here are five things you can and should focus on when building a system to make sure that, as its use scales upwards, availability remains high:
- Build with failure in mind
- Always think about scaling
- Mitigate risk
- Monitor availability
- Respond to availability issues in a predictable and defined way
Let’s look at each of these individually.
Tip #1: Build with failure in mind
As Werner Vogels, CTO of Amazon, says, “Everything fails all the time.” Plan on your applications and services failing. It will happen. Now, deal with it.
Assuming your application will fail, how will it fail? As you build your system, consider availability concerns during all aspects of your system design and construction. For example:
What design constructs and patterns have you considered or are you using that will help improve the availability of your software?
Using design constructs and patterns, such as simple error catching deep within your application, retry logic, and circuit breakers in a way that allows you to catch errors when they have affected the smallest available subset of functionality. This allows you to limit the scope of a problem and have your application still provide useful capabilities even if part of the application is failing.
What do you do when a component you depend on fails? How do you retry? What do you do if the problem is an unrecoverable (hard) failure, rather than a recoverable (soft) failure?
Circuit breaker patterns are particularly useful for handling dependency failures because they can reduce the impact a dependency failure has on your system.
Without a circuit breaker, you can decrease the performance of your application because of a dependency failure (for example, because an unacceptably long timeout is required to detect the failure). With a circuit breaker, you can “give up” and stop using a dependency until you are certain that dependency has recovered.
What do you do when a component that is a customer of your system behaves poorly? Can you handle excessive load on your system? Can you throttle excessive traffic? Can you handle garbage data passed in? What about excessive data?
Sometimes, denial-of-service attacks can come from “friendly” sources. For example, a customer of your application may see a sudden surge in activity that requires a significant increase in the volume of requests to your application. Alternatively, a bug in your customer’s application may cause them to call your application at an unacceptably high rate. What do you do when this happens? Does the sudden increase in traffic bring your application down? Or can you detect this problem and throttle the request rate, limiting or removing the impact to your application?
Tip #2: Always think about scaling
Just because your application works now does not mean it will work tomorrow. Most web applications have increasing traffic patterns. A website that generates a certain amount of traffic today might generate significantly more traffic sooner than you anticipate. As you build your system, don’t build it for today’s traffic; build it for tomorrow’s traffic.
Specifically, this might mean:
- Architect in the ability to increase the size and capacity of your databases.
- Think about what logical limits exist to your data scaling. What happens when your database tops out in its capabilities? Identify these and remove them before your usage approaches those limits.
- Build your application so that you can add additional application servers easily. This often involves being observant about where and how state is maintained, and how traffic is routed.
- Redirect static traffic to offline providers. This allows your system to only deal with the dynamic traffic that it is designed to deal with. Using external content delivery networks (CDNs) not only can reduce the traffic your network has to handle, but also allows the efficiencies of scale that CDNs provide in order to get that static content to your customers more quickly.
- Think about whether specific pieces of dynamic content can actually be generated statically. Often, content that appears dynamic is actually mostly static and the scalability of your application can be increased by making this content static. This “dynamic that can be static” data is sometimes hidden where you don’t expect it, as the following tip discusses.
Tip #3: Mitigate risk
Keeping a system highly available requires removing risk from the system. When a system fails, often the cause of the failure could have been identified as a risk before the failure actually occurred. Identifying risk is a key method of increasing availability. All systems have risk in them:
- There is risk that a server will crash.
- There is risk that a database will become corrupted.
- There is risk that a returned answer will be incorrect.
- There is risk that a network connection will fail.
- There is risk that a newly deployed piece of software will fail.
Keeping a system available requires removing risk. But as systems become more and more complicated, this becomes less and less possible. Keeping a large system available is more about managing what your risk is, how much risk is acceptable, and what you can do to mitigate that risk.
We call this risk management, and it is at the heart of building highly available systems.
Part of risk management is risk mitigation. Risk mitigation is knowing what to do when a problem occurs in order to reduce the impact of the problem as much as possible. Mitigation is about making sure your application works as best and as completely as possible, even when services and resources fail. Risk mitigation requires thinking about the things that can go wrong, and putting a plan together, now, to be able to handle the situation when it does happen. Risk management is the process of identifying the risk, determining what to do, and implementing these mitigations.
This process will often uncover unknown problems in your application that you will want to fix immediately instead of waiting for them to occur. It also can create processes and procedures to handle known failure modes so that the cost of that failure is reduced in duration or severity.
Availability and risk management go hand in hand. Building a highly available system is significantly about managing risk.
Tip #4: Monitor availability
You can’t know if there is a problem in your application unless you can see there is a problem. Make sure your application is properly instrumented so that you can see how the application is performing from an external perspective as well as internal monitoring.
Proper monitoring depends on the specifics of your application and needs, but usually entails some of the following capabilities:
To monitor the health of your servers and make sure they keep operating efficiently.
Configuration change monitoring
To monitor your system configuration to identify if and when changes to your infrastructure impact your application.
Application performance monitoring
To look inside your application and services to make sure they are operating as expected.
To examine in real time how your application is functioning from the perspective of your users, in order to catch problems customers might see before they actually see them.
To inform appropriate personnel when a problem occurs so that it can be quickly and efficiently resolved, minimizing the impact to your customers.
There are many good monitoring systems available, both free and paid services. I personally recommend New Relic. It provides all of the aforementioned monitoring and alerting capabilities. As a Software as a Service (SaaS) offering, it can support the monitoring needs at pretty much any scale your application may require.
After you have started monitoring your application and services, start looking for trends in your performance. When you have identified the trends, you can look for outliers and treat them as potential availability issues. You can use these outliers by having your monitoring tools send you an alert when they are identified, before your application fails. Additionally, you can track as your system grows and make sure your scalability plan will continue to work.
Establish internal private operational goals for service-to-service communications, and monitor them continuously. This way, when you see a performance or availability-related problem, you can quickly diagnose which service or system is responsible and address the problem Additionally, you can see “hot spots”—areas where your performance is not what it could be—and put development plans in place to address these issues.
Tip #5: Respond to availability issues in a predictable and defined way
Monitoring systems are useless unless you are prepared to act on the issues that arise. This means being alerted when problems occur so that you can take action. Additionally, you should establish processes and procedures that your team can follow to help diagnose issues and easily fix common failure scenarios.
For example, if a service becomes unresponsive, you might have a set of remedies to try to make the service responsive. This might include tasks such as running a test to help diagnose where the problem is, restarting a daemon that is known to cause the service to become unresponsive, or rebooting a server if all else fails. Having standard processes in place for handling common failure scenarios will decrease the amount of time your system is unavailable. Additionally, they can provide useful follow-up diagnosis information to your engineering teams to help them deduce the root cause of common ailments.
When an alert is triggered for a service, the owners of that service must be the first ones alerted. They are, after all, the ones responsible for fixing any issues with their service. However, other teams who are closely connected to the troubled service and depend on it might also want to be alerted of problems when they occur. For example, if a team makes use of a particular service, they may want to know when that service fails so that they can potentially be more proactive in keeping their systems active during the dependent service outage.
These standard processes and procedures should be part of a support manual available to all team members who handle oncall responsibility. This support manual should also contain contact lists for owners of related services and systems as well as contacts to call to escalate the problem if a simple solution is not possible.
All of these processes, procedures, and support manuals should be prepared ahead of time so that during an outage your oncall personnel know exactly what to do in various circumstances to restore operations quickly. These processes and procedures are especially useful because outages often occur during inconvenient times such as the middle of the night or on weekends—times when your oncall team might not perform at peak mental efficiency. These recommendations will assist your team in making smarter and safer moves toward restoring your system to operational status.
No one can anticipate where and when availability issues will occur. But you can assume that they will occur, especially as your system scales to larger customer demands and more complex applications. Being prepared in advance to handle availability concerns is the best way to reduce the likelihood and severity of problems. The five techniques discussed here offer a solid strategy for keeping your applications highly available.