
350 High Performance Parallel I/O
the risk of incorrect operation due to transients which explore these corner
cases grows. Recent experience justifies concerns about these risks, which are
already showing up as root causes of intermittent errors in large-scale ma-
chines.
One area of research with great potential impact is to break the reliance
on tightly coupled applications that are unable to handle faults in any used
software or hardware component. This is sometimes referred to as local failure
causing global failure and restart. Instead, approaches that focus on allowing
portions of a calculation to fail while other portions continue (perhaps at ...