Chapter 17. Engineering for Data Durability

SREs live and breathe reliability, but to many engineers the word reliability is synonymous with availability: “How do we keep the site up?” Reliability is a multifaceted concern, however, and an extremely important part of this is durability: “How do we avoid losing or corrupting our data?”

Engineering for durability is of paramount importance for any company that stores user data. Most companies can survive a period of downtime, but few can survive losing a significant fraction of user data. Building expertise in durable systems is particularly challenging, however; most companies improve availability over time as they grow and as their systems mature, but a single durability mistake can be a company-ending event. It’s therefore important to invest effort ahead of time to understand real-world durability threats and how to engineer against them.

Replication Is Table Stakes

If you don’t want to lose your data, you should store multiple copies of it. You probably didn’t need a book to tell you this. We’ll breeze through most of this pretty quickly because this is really just the basic requirements when it comes to durability.

Backups

Back up your data. The great thing about backups is that they’re logically and physically disjointed from your primary data store: an operational error that results in loss or corruption of database state probably won’t impact your backups. Ideally, these should be stored both ...

Get Seeking SRE now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.