The Clone models we’ve explored up until now have been relatively simple. However, we’re now going to get into unpleasantly complex territory, which has me getting up for another espresso. You should appreciate the fact that implementing “reliable” messaging is complex enough that you always need to ask, “Do we actually need this?” before jumping into it. If you can get away with being unreliable, or with “good enough” reliability, you can make a huge win in terms of cost and complexity. Sure, you may lose some data now and then. It is often a good trade-off. Having said, that, and... sips... because the espresso is really good, let’s jump in.
As you play with the last model, you’ll stop and restart the server. It might look like it recovers, but of course it’s applying updates to an empty state instead of the proper current state. Any new client joining the network will only get the latest updates instead of the full historical record.
What we want is a way for the server to recover from being killed or crashing. We also need to provide backup in case the server is out of commission for any length of time. When people ask for “reliability,” ask them to list the failures they want to handle. In our case, these are:
The server process crashes and is automatically or manually restarted. The process loses its state and has to get it back from somewhere.
The server machine dies and is offline for a significant time. Clients have to switch to ...