Chapter 13

Disaster Recovery Planning

Every big computing disaster has come from taking too many ideas and putting them in one place.

—Gordon Bell

When it comes to technology, everything can and will fail. In distributed environments, like many cloud-based solutions, there are many moving parts and any part of the system can fail at any time. The secret to surviving failures is to expect everything to fail and design for those failures. Failures come in many forms. The damage caused by a web server crashing is easily mitigated by having multiple web servers behind a load balancer. A database server crashing is a more severe failure that requires more systems thinking to properly recover from. A data center going down is an even more severe failure and can ruin a business if a well-designed disaster recovery solution is not in place.

Each cloud service model has a different set of challenges when it comes to disasters. In the following paragraphs we will discuss some best practices for each cloud service model when dealing with disaster situations in the cloud.

What Is the Cost of Downtime?

Cloud computing allows us to build systems faster and cheaper than ever before. There are countless stories of how companies rapidly built and deployed solutions that would have taken many months or years to deploy in precloud days. But getting a solution to the marketplace is only half of the story. The other half of the story is deploying a solution that can recover from disasters, big or small. ...

Get Architecting the Cloud: Design Decisions for Cloud Computing Service Models (SaaS, PaaS, and IaaS) now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.