Transient faults are temporary conditions that cause a failure, such as a momentary loss of network connectivity or a service timeout due to overload. Distributed systems composed of multiple services interacting over network are much more prone to transient faults than monolith applications.
Transient faults are, in most cases, self-correcting, and thus a subsequent retry of the failed operation is likely to succeed. The main challenge with retries, however, is that there is no easy way to distinguish between a transient and non-transient fault, and thus in the case of a non-transient fault, indefinite retries must be prevented.
The following practices help address transient fault handling when implemented in concert: ...