Despite the reliability and availability benefits of running your clusters on a cloud provider, failures still can occur. It’s essential to include disaster recovery procedures as part of your maintenance process, and that starts with safe backups of your cluster data. Even if you have faith in your cloud provider that your resources will remain operational as long as you need them, rules and regulations pertaining to your data can compel you to have a backup process in place.
There are many general techniques for performing backup and restoration for Hadoop clusters. This chapter focuses on aspects of those techniques that are relevant for cloud clusters.
Besides explicit backup procedures, there are other measures you can, and should, employ to provide greater assurance that your cluster data, and clusters themselves, remain available in the face of problems.
“Long-Running or Transient?” discusses the trade-offs between long-running clusters and transient clusters. By their nature, long-running clusters become more and more critical as they accumulate important data as well as unique configurations or software installations. Transient clusters, on the other hand, cannot become the permanent home of any data, and can be spun up easily with the proper automation. Adoption of transient clusters, therefore, can lead to an architecture that is more resilient to failure. When a transient cluster fails, a new similar ...