Fault tolerance

To gracefully handle failures, Mesos implements two features, both enabled by default, known as check pointing and slave recovery. Check pointing is a feature enabled in both the framework and on the slave, which allows certain information about the state of the cluster to be persistent periodically to the disk. The state of the cluster is written to the disk on the Mesos slave server. The check-pointed data includes information on the task, such as executors and status updates. The second one is slave recovery. Slave recovery allows the Mesos slave daemon to read the state from the disk, and reconnect to running executors and tasks should the Mesos slave daemon fail or be restarted. If the Mesos slave daemon fails or is restarted, ...

Get Learn Apache Mesos now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.