Chapter 8. Resilience

This chapter focuses on application resilience, which is the ability to survive situations that might otherwise lead to failure. Unlike other chapters that focused on services external to the Node.js process, this one mostly looks within the process.

Applications should be resilient to certain types of failure. For example, there are many options available to a downstream service like web-api when it is unable to communicate with an upstream service like recipe-api. Perhaps it should retry the outgoing request, or maybe it should respond to the incoming request with an error. But in any case, crashing isn’t the best option. Similarly, if a connection to a stateful database is lost, the application should probably try to reconnect to it, while replying to incoming requests with an error. On the other hand, if a connection to a caching service is dropped, then the best action might be to reply to the client as usual, albeit in a slower, “degraded” manner.

In many cases it is necessary for an application to crash. If a failure occurs that an engineer doesn’t anticipate—often global to the process and not associated with a single request—then the application can potentially enter a compromised state. In these situations it’s best to log the stack trace, leaving evidence behind for an engineer, and then exit. Due to the ephemeral nature of applications, it’s important that they remain stateless—doing so allows future instances to pick up where ...

Get Distributed Systems with Node.js now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.