Chapter 13. Backup and Disaster Recovery

This chapter outlines the concerns around building a sound strategy for keeping data within a Hadoop-based system safe and available such that, in case of data loss through a user error (erroneously deleted data) or a disaster (such as loss of the entire cluster), a restore can be initiated and completed. This restore leaves the cluster users with some kind of reliable state so they can proceed with their business tasks.

Note that this is necessary, even with high availability (see Chapter 12) enabled, because restoring data also applies to problems that do not arise from maintaining a responsive service. Quite the contrary. Even with redundant components on every level of the stack, losing metadata or data may cause a disruption that can only be mitigated with the proper backup or disaster recovery strategy in place beforehand.

Context

Before we can look into the particular approaches, we first need to establish a context.

Many Distributed Systems

Hadoop is a complex system, comprising many open source projects that work in conjunction to build a unique data processing platform. It is the foundation for many large-scale installations across numerous industry verticals, storing and processing from hundreds of terabytes to multiple petabytes of data. At any scale, there is the need to keep the data safe, and the customary approach is to invest in some kind of backup technology. The difficulties with this approach are manifold.

First, data ...

Get Architecting Modern Data Platforms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.