Chapter 10. Node Failure Pattern

This pattern focuses on how an application should respond when the compute node on which it is running shuts down or fails.

This pattern reflects the perspective of application code running on a node (virtual machine) that is shut down or suddenly fails due to a software or hardware issue. The application has three responsibilities: prepare the application to minimize issues when nodes fail, handle node shutdowns gracefully, and recover once a node has failed.

Some common reasons for shutdown are unresponsive application due to application failure, routine maintenance activities managed by the cloud vendor, and auto-scaling activities initiated by the application. Failures might be caused by hardware failure or an unhandled exception in your application code.

While there are many reasons for a node shutdown or failure, we can still treat them uniformly. Handling the various forms of failure is sufficient; all shutdown scenarios will also be handled. The pattern name derives from the more encompassing node failures.

Warning

Applications that do not handle node shutdowns and failures will be unreliable.

Context

The Node Interruption Pattern is effective in dealing with the following challenges:

  • Your application is using the Queue-Centric Workflow Pattern and requires at-least-once processing for messages sent across tiers

  • Your application is using the Auto-Scaling Pattern and requires graceful shutdown of compute nodes that are being released

  • Your application ...

Get Cloud Architecture Patterns now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.