The Recovery Challenge
Proper error handling and recovery is the Achilles’ heel of many applications. Once an application fails to perform a particular operation, you should recover from it and restore the system—that is, the collection of interacting services and clients—to a consistent state, usually the state the system was at before the operation that caused the error took place. Typically, any operation that can fail consists of multiple, potentially concurrent, smaller steps. Some of those steps can fail while the others succeed. The problem with recovery is the sheer number of partial success and partial failure permutations that you have to code against. For example, an operation comprising 10 smaller, concurrent steps has some three million recovery scenarios, because for the recovery logic, the order in which the operations fails matters as well, and the factorial of 10 is roughly three million.
Trying to handcraft recovery code in a decent-size application is often a futile attempt, resulting in fragile code that is very susceptible to any change in the application execution or the business use case, incurring both productivity and performance penalties. The productivity penalty results from simply putting in all the effort for handcrafting the recovery logic. The performance penalty is inherited with such an approach because you need to execute huge amounts of code after every operation to verify all is well. In reality, developers tend to deal only with the easy recovery ...