Outages are the norm, not the exception, and though they are rare, they are inevitable. This is something we all have to deal with.
Amazon AWS has been designed with failure in mind, and you can do a number of things to survive outages and other kinds of problems.
At this point there are very few public cloud vendors that offer a comparable suite of services to Amazon AWS. This is only a matter time.
There are several cloud management tools that advertise multi-vendor (multi-cloud) as their core feature. They try to frighten you into purchasing their added complexity. The result is that you can’t use the more sophisticated services of one cloud, because they are not supported on another.
The core reason to use a public cloud is to get rid of the burden to own and manage these types of infrastructures. You pay your cloud vendor to handle the responsibility of making sure there are facilities to properly handle failure. If they do not deliver, you move on.
Instead of wasting time supporting multiple clouds, we choose Amazon AWS. Because we make this choice, we can use all of their services if they make sense for us. We do not worry if they are supported on other clouds; we just use them because they save us time and money.
Even though the cloud has dominated technology news for years, this is still uncharted territory. Existing systems reflect this and are not well prepared for the nature of this type of infrastructure platform.
Features that are considered old (tablespaces in Postgres) can be given new life with services like EBS from AWS. Existing automation tools feel awkward in dynamic environments built on top of AWS with things like images and snapshots.
Also the work changes. Infrastructure literally moves into software engineering. And we are not ready yet.
So be ready for change. Embrace it. You might have spent three months building a workflow tool only to learn about SWF. Embrace this, and refactor if you think it will improve your system. Look for these opportunities, and take full advantage of them.
Failure is something we have always known but never accepted. We have been buying system components with dual-anything, to prevent them from breaking. We have spent a lot of time trying to get rid of malfunction.
Well, the world is not perfect, and we will never be able to completely avoid malfunctions. Things will break; systems will fail.
You may lose a snapshot of your system, which may not be a big problem if you have more recent snapshots. But if you lose a snapshot from an image that you rely on for recovery you really do have to act on that.
You may lose a volume, too, which is not a problem if it is not used and you have a recent snapshot. But it is definitely a problem if you have data you didn’t persist somewhere else. (It turns out that snapshotting rejuvenates your volumes.)
You can lose instances. You can even lose entire Availability Zones. You know this, so you shouldn’t act surprised when it happens. AWS has multiple Availability Zones, and you should use them. They have an almost unlimited supply of instances, so make sure you can replace your instances easily. And a snapshot doesn’t cost very much, so make sure you take them often enough.
For the sake of our top 10 tips, we’ll consider AWS our enemy. They do not make many promises, and in fact they stress as many times as possible that things will break. Make sure everyone has a basic understanding of AWS-style cloud engineering. All your software developers need to understand about the different ways to persist data in AWS. Everyone needs to have a basic architectural understanding. They need to know about services like SQS and ELB. It is a good idea to share operational responsibility, as that is the fastest way to disseminate this information.
Netflix introduced a rather radical approach to this. Instead of failure scenarios, they introduced failure itself. Their Chaos Monkey wreaks havoc continuously in their production infrastructure.
A good understanding of AWS alone is not enough. You need to know yourself as well. Be sure to spend time building failure scenarios and testing them to see what happens. If you have practiced with these scenarios your teams will be trained, there will be fewer surprises, and even the surprises will be handled much better.
There is always something that can be improved upon in terms of Resilience and Reliability. Make it a priority for everyone to have a personal top 3!
If you build your stuff for the problems you have now, you will stay focused and do the right things. This doesn’t mean that you can only solve small problems—it means that you can steer clear of doing unnecessary or expensive things that do not add value.
This is the main reason why we always try to use as many off-the-shelf components, preferably from AWS. Perhaps ELB is not perfect yet, but we choose to work on features, instead of operating multi-availability-zone load balancers.
Same goes for RDS. You can debate whether Postgres has better transaction support than MySQL. But with RDS we move on to working on application functionality, instead of building a highly available Postgres database that looks like RDS.
So, pick your targets and stay focused. Don’t spend time trying to build a better queuing system with RabbitMQ if SQS does the trick.
I wish everyone was curious by nature—not only curious, but also interested in the inner working of things. If you are curious by nature, then you should regularly browse your AWS accounts looking for anomalies. This curiosity should be cultivated because there are interesting events happening all the time. The first thing we need to do is identify them, and then we can monitor for their occurrences.
If you question everything you can easily identify waste, which is something we categorically try to prevent.
The more resources you have, the more they can (and will) fail. Start by minimizing the different types of resources. Why do you need Postgres and MySQL? Why do you need MongoDB and DynamoDB?
Don’t waste your resources; make sure they do their fair share of work. Tune your instances for reliable operation (stay within the instance resources of CPU and memory). And always try to run the minimum amount of instances necessary.
Do not keep unused resources. Not only do they cost money but it is also more difficult to find problems with a lot of rubbish around. Services like Auto Scaling are free and force you to think about flexible resource allocation.
AWS is constantly looking to innovate in the space of resource utilization. They introduced Spot Instances, for example, to optimize utilization distribution by introducing a marketplace. You can bid on instances, which will be launched when the chosen price point is reached. Another example is Reduced Redundancy Storage with S3, less reliable but significantly cheaper. And very recently they introduced Glacier, a storage service like S3, but analogous to tape backups.
But, there is another, even more important reason to keep your nest clean. It is a much more pleasant place to work. It will feel like home to you and everyone else on your team.
Learning starts with respect. Your colleagues often deal with similar problems. You might think your problems are the most difficult in the world. Well, they probably are not. Talk to your colleagues. Be honest and share your doubts and worries. They will tell you theirs. And together you will figure out how to deal with the next outage.
Amazon AWS forums
The Amazon AWS forums might be their most important asset. There are thousands of people discussing services and finding help for the problems they encounter.
If you can’t find your answer directly, you will find many helpful people with potentially the same experience as you.
Finally, make sure you build a team. Running operations, large or small, is difficult. Having everything under control makes this fun, but doing it together takes the edge off.
There are different event streams with more or less important bits of information. You can try to build a tool to sift out every meaningful bit. You can also rely on your team to use weak signals in all the noise. Outages and routine operational work is much more fun to do with your colleagues.
Remember, engineering is teamwork!
Get Resilience and Reliability on AWS now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.