Chapter 5. Managing the Inevitable Downtime

With smaller infrastructures, downtime is inevitable. Small infrastructures are different from large ones in that you basically do not have the means to get rid of all your single points of failure. With physical systems, downtime due to hardware failure is a big problem, and waiting for replacement parts is a nerve-wracking experience. And if you have the funds to stock replacements, you can just as well put them in production and remove your single points of failure. With a cloud infrastructure, you don’t have this problem; you can replace most of your assets whenever you want. This characteristic is central in our approach to managing small infrastructures. You might say we plan to fail.

As in hardware infrastructures, in cloud infrastructures, failing hardware is one cause of trouble. Insufficient capacity is another. In this chapter, we will look at how to measure your system. Is the app up or down? Are the disks over capacity? Is the load breaching expected thresholds? What is the CPU utilization of the RDS instance? We will show you how to monitor your systems from the inside and the outside. We’ll take a close look at CloudWatch. We will describe the tools you can use to understand what your system is doing. With this understanding, you can manage your infrastructure if it goes down, or just help it cope with increasing demands. Having limited resources is an opportunity to optimize your system and get the most out of your hardware. ...

Get Programming Amazon EC2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.