Chapter 7. Backup and Recovery

In Chapters 5 and 6, we focused on infrastructure design and management. This means that at this point we have a good feeling for how to build, deploy, and manage distributed infrastructures running databases. This includes techniques for rapidly adding new nodes for capacity or to replace a failed node. Now, it’s time to discuss the serious meat and potatoes: data backup and recovery.

Let’s face it. Everyone considers backup and recovery dull and tedious. Most think of it as the epitome of toil. It is often relegated to junior engineers, outside contractors, and third-party tooling that the team is loathe to interact with. We’ve worked with some pretty horrible backup software before. Trust me, we empathize.

Still, this is one of the most crucial processes in your operations toolkit. Moving your precious data between nodes, across datacenters, and into long-term archives is the constant movement of your business’ most precious commodity: its data. Rather than relegating this to a second-class citizen of Ops, we strongly suggest you treat it as a VIP. Everyone should understand not only the recovery targets, but be intimately familiar with operating and monitoring the processes. Many DevOps philosophies propose that everyone should have an opportunity to write and push code to production. We propose that every engineer should participate at least once in the recovery processes of critical data.

We create and store copies of data, otherwise ...

Get Database Reliability Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.