Fault tolerance

To gracefully handle failures, Mesos implements two features, both enabled by default, known as check pointing and slave recovery. Check pointing is a feature enabled in both the framework and on the slave, which allows certain information about the state of the cluster to be persistent periodically to the disk. The state of the cluster is written to the disk on the Mesos slave server. The check-pointed data includes information on the task, such as executors and status updates. The second one is slave recovery. Slave recovery allows the Mesos slave daemon to read the state from the disk, and reconnect to running executors and tasks should the Mesos slave daemon fail or be restarted. If the Mesos slave daemon fails or is restarted, ...

Get Learn Apache Mesos now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.