Chapter 1. What Is Availability?
One of the most important topics in architecting for scalable systems is availability. Although there are some companies and some services for which a certain amount of downtime is reasonable and expected, most businesses cannot have any downtime at all without it affecting their customer’s satisfaction, and ultimately, the company’s bottom line.
The following are fundamental questions that all companies must ask themselves when they determine how important system availability is to their company and their customers. It is these questions, and the inevitable answers to them, that are the core of why availability is critical to highly scaled applications:
Why should someone buy your service if it is not operational when they need it?
What do your customers think or feel when they need to use your service and it’s not operational?
How can you make your customers happy, make your company money, and meet your business promises and requirements, if your service is down?
Keeping your customers happy and engaged with your system is only possible if your system is operational. There is a direct and meaningful correlation between system availability and customer satisfaction.
High availability is such a critical component for building highly scalable systems that we will devote a significant amount of time to the topic in this book. How do you build a system (a service or application or environment) that is highly available even when a wide range of demands are placed it?
In this chapter, we’ll define what availability is and how it compares to reliability. We use this in future chapters as we discuss the role availability plays in building highly scalable applications.
Availability Versus Reliability
Reliability, in our context, generally refers to the quality of a system. Typically, it means the ability of a system to consistently perform according to specifications. You speak of software as reliable if it passes its test suites, and does generally what you think it should do.
Availability, in our context, generally refers to the ability of your system to perform the tasks it is capable of doing. Is the system around? Is it operational? Is it responding? If the answer is “yes,” it is available.
As you can see, availability and reliability are very similar. It is hard for a system to be available if it is not also reliable, and it is hard for a system to be reliable if it is not also available.
However, typically when we think about reliability and software, we are generally referring to the ability for software to perform what it is supposed to do. By and large, the main indicator of reliability is whether the software passes all of its test suites.
Moreover, when we think about availability, we think about whether the system is “up” and functional. If I send it a query, will it respond?
Here is what we mean when we use these terms:
A system that adds 2 + 3 and gets 6 has poor reliability. A system that adds 2 + 3 and never returns a result at all has poor availability. Reliability can often be fixed by testing. Availability is usually much harder to solve.
You can introduce a software bug in your application that can cause 2 + 3 to produce the answer 6. This can be easily caught and fixed in a test suite.
However, assume you have an application that reliably produces the result 2 + 3 = 5. Now imagine running this application on a computer that has a flaky network connection. The result? You run the application and sometimes it returns 5 and sometimes it doesn’t return anything. The application may be reliable, but it is not available.
In this book, we focus almost exclusively on architecting highly available systems. We will assume your system is reliable, we will assume you know how to build and run test suites, and we will only discuss reliability when it has a direct impact on your system architecture or its availability.
What Causes Poor Availability?
- Resource exhaustion
- Unplanned load-based changes
Increases in the popularity of your application might require code and application changes to handle the increased load. These changes, often implemented quickly and at the last minute with little or no forethought or planning, increase the likelihood of problems occurring.
- Increased number of moving parts
As an application gains popularity, it is often necessary to assign more and more developers, designers, testers, and other individuals to work on and maintain it. This larger number of individuals working on the application creates a large number of moving parts, whether those moving parts are new features, changed features, or just general application maintenance. The more individuals working on the application, the more moving parts within the application and the greater the chance for bad interactions to occur in it.
- Outside dependencies
The more dependencies your application has on external resources, the more it is exposed to availability problems caused by those resources.
- Technical debt
Increases in the applications complexity typically increases technical debt (i.e., the accumulation of desired software changes and pending bug fixes that typically build up over time as an application grows and matures). Technical debt increases the likelihood of a problem occurring.
All fast-growing applications have one, some, or all of these problems. As such, potential availability problems can begin occurring in applications that previously performed flawlessly. Often the problems will creep up on you; often they will start suddenly.
But most growing applications have the same problem. They eventually will begin having availability problems.
Availability problems cost you money, they cost your customer’s money, and they cost you your customer’s trust and loyalty. Your company cannot survive for long if you constantly have availability problems.