So, to make things brutally simple, reliability is âkeeping things working properly when code freezes or crashes,â a situation weâll shorten to âwhen code dies.â However, the things we want to keep working properly are more complex than just messages. We need to take each core ÃMQ messaging pattern and see how to make it work (if we can) even when code dies.
Letâs take them one by one:
If the server dies while processing a request, the client can figure that out because it wonât get an answer back. Then it can give up in a huff, wait and try again later, find another server, etc. As for the client dying, we can brush that off as âsomeone elseâs problemâ for now.
If the client dies (having gotten some data), the server wonât know about it. Pub-sub doesnât send any information back from the client to the server. However, the client can contact the server out-of-bandâe.g., via request-replyâand say, âPlease resend everything I missed.â As for the server dying, thatâs outside the scope of this discussion. Subscribers can also self-verify that theyâre not running too slowly, and take action (e.g., warn the operator and die) if they are.
If a worker dies (while working), the ventilator doesnât know about it. Pipelines, like pub-sub and the grinding gears of time, only work in one direction. But the downstream collector can detect that one task didnât get done, and send a message back to the ventilator saying, âHey, resend task 324!â If the ventilator or collector dies, whatever upstream client originally sent the work batch can get tired of waiting and resend the whole lot. Itâs not elegant, but system code should really not die often enough for this to matter.
In this chapter weâll focus just on request-reply, which is the low-hanging fruit of reliable messaging.
The basic request-reply pattern (a REQ client socket doing a blocking send/receive to a REP server socket) scores low on handling the most common types of failure. If the server crashes while processing the request, the client just hangs forever. Similarly, if the network loses the request or the reply, the client hangs forever.
Request-reply is still much better than TCP, thanks to ÃMQâs ability to reconnect peers silently, to load-balance messages, and so on. But itâs still not good enough for real work. The only case where you can really trust the basic request-reply pattern is between two threads in the same process where thereâs no network or separate server process to die.
However, with a little extra work, this humble pattern becomes a good basis for real work across a distributed network, and we get a set of reliable request-reply (RRR) patterns that I like to call the Pirate patterns (youâll eventually get the joke, I hope).
There are, in my experience, roughly three ways to connect clients to servers. Each needs a specific approach to reliability:
Multiple clients talking directly to a single server. Use case: a single well-known server to which clients need to talk. Types of failure we aim to handle: server crashes and restarts, and network disconnects.
Multiple clients talking to a broker proxy that distributes work to multiple workers. Use case: service-oriented transaction processing. Types of failure we aim to handle: worker crashes and restarts, worker busy looping, worker overload, queue crashes and restarts, and network disconnects.
Multiple clients talking to multiple servers with no intermediary proxies. Use case: distributed services such as name resolution. Types of failure we aim to handle: service crashes and restarts, service busy looping, service overload, and network disconnects.