Chapter 14. Systems That Never Stop

You need at least two computers to make a fault-tolerant system. Built-in Erlang distribution, no shared memory, and asynchronous message passing give you the foundations needed for replicating data across these computers, so if one computer crashes, the other can take over. The good news is that the error-handling techniques, fault isolation, and self-healing that apply to single-node systems also help immensely when multiple nodes are involved, allowing you to transparently distribute your processes across clusters and use the same failure detection techniques you use on a single node. This makes the creation of fault-tolerant systems much easier and more predictable than having to write your own libraries to handle semantic gaps, which is typically what’s required with other languages. The catch is that Erlang on its own will not give you a fault-tolerant system out of the box—but its programming model will, and at a fraction of the effort required by other current technologies.

In this chapter, we continue explaining approaches to distributed programming commonly used in Erlang systems. We focus on data replication and retry strategies across nodes and computers, and the compromises and tradeoffs needed to build systems that never stop. These approaches affect how you distribute your data, and how you retry requests if they have failed for reasons out of your control.


Availability defines the uptime of a system over a certain period ...

Get Designing for Scalability with Erlang/OTP now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.