Designing Reliability

So, to make things brutally simple, reliability is “keeping things working properly when code freezes or crashes,” a situation we’ll shorten to “when code dies.” However, the things we want to keep working properly are more complex than just messages. We need to take each core ØMQ messaging pattern and see how to make it work (if we can) even when code dies.

Let’s take them one by one:


If the server dies while processing a request, the client can figure that out because it won’t get an answer back. Then it can give up in a huff, wait and try again later, find another server, etc. As for the client dying, we can brush that off as “someone else’s problem” for now.


If the client dies (having gotten some data), the server won’t know about it. Pub-sub doesn’t send any information back from the client to the server. However, the client can contact the server out-of-band—e.g., via request-reply—and say, “Please resend everything I missed.” As for the server dying, that’s outside the scope of this discussion. Subscribers can also self-verify that they’re not running too slowly, and take action (e.g., warn the operator and die) if they are.


If a worker dies (while working), the ventilator doesn’t know about it. Pipelines, like pub-sub and the grinding gears of time, only work in one direction. But the downstream collector can detect that one task didn’t get done, and send a message back to the ventilator saying, “Hey, resend task 324!” If the ventilator or collector dies, whatever upstream client originally sent the work batch can get tired of waiting and resend the whole lot. It’s not elegant, but system code should really not die often enough for this to matter.

In this chapter we’ll focus just on request-reply, which is the low-hanging fruit of reliable messaging.

The basic request-reply pattern (a REQ client socket doing a blocking send/receive to a REP server socket) scores low on handling the most common types of failure. If the server crashes while processing the request, the client just hangs forever. Similarly, if the network loses the request or the reply, the client hangs forever.

Request-reply is still much better than TCP, thanks to ØMQ’s ability to reconnect peers silently, to load-balance messages, and so on. But it’s still not good enough for real work. The only case where you can really trust the basic request-reply pattern is between two threads in the same process where there’s no network or separate server process to die.

However, with a little extra work, this humble pattern becomes a good basis for real work across a distributed network, and we get a set of reliable request-reply (RRR) patterns that I like to call the Pirate patterns (you’ll eventually get the joke, I hope).

There are, in my experience, roughly three ways to connect clients to servers. Each needs a specific approach to reliability:

  1. Multiple clients talking directly to a single server. Use case: a single well-known server to which clients need to talk. Types of failure we aim to handle: server crashes and restarts, and network disconnects.

  2. Multiple clients talking to a broker proxy that distributes work to multiple workers. Use case: service-oriented transaction processing. Types of failure we aim to handle: worker crashes and restarts, worker busy looping, worker overload, queue crashes and restarts, and network disconnects.

  3. Multiple clients talking to multiple servers with no intermediary proxies. Use case: distributed services such as name resolution. Types of failure we aim to handle: service crashes and restarts, service busy looping, service overload, and network disconnects.

Each of these approaches has its trade-offs, and often you’ll mix them. We’ll look at all three in detail.

Get ZeroMQ now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.