A Hadoop cluster running in the cloud has some intrinsic reliability built in, due to the robustness of the cloud provider. Their data centers are built with reliable power sources and network connections. They also accommodate the constant parade of failing motherboards, memory chips, and hard drives that come with running huge numbers of servers. Often, when these “normal” hardware failures occur, your instances and infrastructure components are automatically migrated over to alternative resources. Sometimes, you won’t even notice that anything went wrong.
Still, there are some failures a cloud provider can’t hide from its customers. Disks can become corrupted due to either software or hardware failures. Although rare, network hiccups or power outages at a cloud provider data center can cause instances, or even whole availability zones, to disappear for some amount of time. Even some of those “normal” hardware failures can’t be automatically handled every time.
Given that the risk of cluster failures is not completely eliminated by running on a cloud provider, it is reasonable to have a strategy in place to reduce their impact.
Running in the cloud, if a cluster fails, it’s completely feasible to simply spin up a new one to take its place. As long as the data the cluster was operating on is preserved, for example, in cloud provider storage services, a new cluster can be created in the same or a different availability zone, or even a different region ...