The Binary Star pattern is as simple as it can be, while still working accurately. In fact, the current design is the third complete redesign. Each of the previous designs we found to be too complex, trying to do too much, and we stripped out functionality until we came to a design that was understandable, easy to use, and reliable enough to be worth using.
These are our requirements for a high-availability architecture:
The failover is meant to provide insurance against catastrophic system failures, such as hardware breakdown, fire, accident, and so on. There are simpler ways to recover from ordinary server crashes, and we already covered these.
Failover time should be under 60 seconds, and preferably under 10 seconds.
Failover has to happen automatically, whereas recovery must happen manually. We want applications to switch over to the backup server automatically, but we do not want them to switch back to the primary server except when the operators have fixed whatever problem there was and decided that it is a good time to interrupt applications again.
The semantics for client applications should be simple and easy for developers to understand. Ideally, they should be hidden in the client API.
There should be clear instructions for network architects on how to avoid designs that could lead to “split-brain syndrome,” in which both servers in a Binary Star pair think they are the active server.
There should be no dependencies on the order in which the two servers ...