Your application is operational and online. Your systems are in place, and your team is operating efficiently. Everything seems to be going well. Your traffic is steadily increasing and your sales organization is very happy. All is well.
Then there’s a bit of a slip. Your system suffers an unanticipated outage. But that’s OK; your availability has been fantastic until now. A little outage is no big deal. Your traffic is still increasing. Everyone shrugs it off—it was just “one of those things.”
Then it happens again—another outage. Oops. Well, OK. Overall, we’re still doing well. No need to panic; it was just another “one of those things.”
Then another outage...
Now your CEO is a bit concerned. Customers are beginning to ask what’s going on. Your sales team is starting to worry.
Then another outage...
Suddenly, your once stable and operational system is becoming less and less stable; your outages are getting more and more attention.
Now you’ve got real problems.
What happened? Keeping your system highly available is a daunting task. What do you do if availability begins to slip? What do you do if your application availability has fallen or begins to fall, and you need to improve it to keep your customers satisfied?
Knowing what you can do when your availability begins to slip will help you to avoid falling into a vicious cycle of problems. The following sections outline some steps you can take when your availability begins to falter.
Measure and track your current availability
To understand what is happening to your availability, you must first measure what your current availability is. (See Chapter 3 for a refresher on how to measure availability.) Tracking when your application is available and when it isn’t gives you an availability percentage that can show how you are performing over a specific period of time. You can use this to determine if your availability is improving or faltering.
You should continuously monitor your availability percentage and report the results on a regular basis. On top of this, overlay key events in your application, such as when you performed system changes and improvements. This way you can see whether there is a correlation over time between system events and availability issues. This can help you to identify risks to your availability.
Next, you must understand how your application can be expected to perform from an availability standpoint. A tool that you can use to help manage your application availability is service tiers. These are simply labels associated with services that indicate how critical a service is to the operation of your business. This allows you and your teams to distinguish between mission-critical services and those that are valuable but not essential. I’ll discuss service tiers in more depth in Chapter 16.
Finally, create and maintain a risk matrix. With this tool, you can gain visibility into the technical debt and associated risk present in your application. Risk matrices are covered more fully in Chapter 7 and risk in general is discussed in Chapters 5 and 6.
Now that you have a way to track your availability and a way of identifying and managing your risk, you will want to review your risk management plans on a regular basis.
Additionally, you should create and implement mitigation plans to reduce your application risks. This will give you a concrete set of tasks you and your development teams can implement to tackle the riskiest parts of your application. This is discussed in detail in Chapter 8.
Automate your manual processes
To maintain high availability, you need to remove unknowns and variables. Performing manual operations is a common way to insert variable results and/or unknown results into your system.
You should never perform a manual operation on a production system.
When you make a change to your system, the change might improve or it might compromise your system. Using only repeatable tasks gives you the following:
- The ability to test a task before implementing it. Testing what happens when you make a specific change is critical to avoiding mistakes that cause outages.
- The ability to tweak the task to perform exactly what you want the task to do. This lets you implement improvements to the change you are about to make, before you make them.
- The ability to have the task reviewed by a third party. This increases the likelihood that the task will not have unexpected side effects.
- The ability to put the task under version control. Version control systems allow you to determine when the task is changed, by who, and for what reasons.
- The ability to apply the task to related resources. Making a change to a single server that improves how that server works is great. Being able to apply the same change to every affected server in a consistent way makes the task even more useful.
- The ability to have all related resources act consistently. If you continuously make “one off” changes to resources such as servers, the servers will begin to drift and act differently from one another. This makes it difficult to diagnose problematic servers because there will be no baseline of operational expectation that you can use for comparison.
- The ability to implement repeatable tasks. Repeatable tasks are auditable tasks. Auditable tasks are tasks that you can analyze later for their impact, positive or negative, on the system as a whole.
There are many systems for which no one has access to the production environment. Period. The only access to production is through automated processes and procedures. The owners of these systems lock down their environments like this specifically for the aforementioned reasons.
In summary, if you can’t repeat a task, it isn’t a useful task. There are many places where adding repeatability to changes will help keep your system and application stable. This includes server configuration changes, performance tuning tweaks and adjustments, restarting servers, restarting jobs and tasks, changing routing rules, and upgrading and deploying software packages.
By automating deploys, you guarantee changes are applied consistently throughout your system, and that you can apply similar changes later with known results. Additionally, rollbacks to known good states become more reliable with automated deployment systems.
Rather than “tweaking a configuration variable” in the kernel of a server, use a process to apply the change in an automated manner. For example, write a script that will make the change, and then check that script into your software change management system. This enables you to make the same change to all servers in your system uniformly. Additionally, when you need to add new servers to your system or replace old ones, having a known configuration that can be applied improves the likelihood that you can add the new server to your system safely, with minimal impact. Tools like Puppet and Chef can help make this process easier to manage.
The same applies to all infrastructure components, not just servers. This includes switches, routers, network components, and monitoring applications and systems.
For configuration management to be useful, it must be used for all system changes, all the time. It is never acceptable to bypass the configuration management system to make a change under any circumstances. Ever.
Don’t worry, I fixed it
You would be surprised the number of times I have received an operational update email that said something like: “We had a problem with one of our servers last night. We hit a limit to the maximum number of open files the server could handle. So I tweaked the kernel variable and increased the maximum number of open files, and the server is operational again.”
That is, it is operating correctly until someone accidentally overwrites the change because there was no documentation of the change. Or, until one of the other servers running the application has the same problem, but did not have this change applied.
Consistency, repeatability, and unfaltering attention to detail is critical to make a configuration management process work. And a standard and repeatable configuration management process such as I describe here is critical to keeping your scaled system highly available.