Chapter 6. Replication
The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or repair.
Douglas Adams, Mostly Harmless (1992)
Replication means keeping a copy of the same data on multiple machines that are connected via a network. As discussed in “Distributed Versus Single-Node Systems”, there are several reasons you might want to replicate data, including:
-
To keep the data geographically close to your users (and thus reduce access latency)
-
To allow the system to continue working even if some of its parts have failed (and thus increase availability and durability)
-
To scale out the number of machines that can serve read queries (and thus increase read throughput)
In this chapter we will assume that your dataset is small enough that each machine can hold a copy of the entire dataset. In Chapter 7 we will relax that assumption and discuss sharding (partitioning) of datasets that are too big for a single machine. In later chapters we will discuss various kinds of faults that can occur in a replicated data system and how to deal with them.
If the data that you’re replicating does not change over time, replication is easy; you just need to copy the data to every node once, and you’re done. All the difficulty in replication lies in handling changes to replicated data, and that’s what this chapter is about. We will discuss ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access