Checkpointing
Abstract
Checkpointing is the process of saving the necessary data from a running application to allow later resumption of the application in the event of system failure or to work around wallclock time execution limitations on a supercomputer. As such, it is a particular feature of high performance computing input/output (I/O), with generally more flexibility in how fast it may occur when compared with other I/O operations. Application checkpoint and restart may be performed entirely by the system without any change to the application code using one of the several available system-level checkpointing tools. Alternatively, application checkpoint and restart may be designed and executed entirely by the application user, with ...
Get High Performance Computing now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.