Chapter 12. Building Resilience In

Distributed systems mean additional latency and a higher chance of failure as requests go over the network. If you get timeouts and retries wrong, a slow service can be worse than a broken one as threads get tied up waiting for it to respond. Once the service recovers, the challenges aren’t over yet, because a thundering herd of requests can bring it back to its knees.

We need to build microservice-based systems differently. The services should be written to handle problems from the things they depend on, including the shut down of the hosts they are running on.

The systems should be resilient to failure, with built-in redundancy. Retries, recovery, and remediation should be automated and graceful wherever possible. The microservice promise of a small blast radius on failure only applies if you have made sure the rest of the system can work when an individual service has problems.

Later in the chapter, I’m going to talk about how to build resilient services, and then resilient systems. First, though, let’s discuss what resilience means, and especially what the challenges are to building a resilient distributed system.

What Is Resilience?

Simply stated, resilience is the capacity to withstand or recover quickly from difficulties.

Things will go wrong in any production system. A resilient software system will continue to provide an acceptable level of service even if some parts of the system are under stress or have stopped working. It will also ...

Get Enabling Microservice Success now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.