Chapter 10. High Availability
A Hadoop cluster running in the cloud has some intrinsic reliability built in, due to the robustness of the cloud provider. Their data centers are built with reliable power sources and network connections. They also accommodate the constant parade of failing motherboards, memory chips, and hard drives that come with running huge numbers of servers. Often, when these ânormalâ hardware failures occur, your instances and infrastructure components are automatically migrated over to alternative resources. Sometimes, you wonât even notice that anything went wrong.
Still, there are some failures a cloud provider canât hide from its customers. Disks can become corrupted due to either software or hardware failures. Although rare, network hiccups or power outages at a cloud provider data center can cause instances, or even whole availability zones, to disappear for some amount of time. Even some of those ânormalâ hardware failures canât be automatically handled every time.
Given that the risk of cluster failures is not completely eliminated by running on a cloud provider, it is reasonable to have a strategy in place to reduce their impact.
Running in the cloud, if a cluster fails, itâs completely feasible to simply spin up a new one to take its place. As long as the data the cluster was operating on is preserved, for example, in cloud provider storage services, a new cluster can be created in the same or a different availability zone, or even ...