Chapter 5

Resilience for extreme scale computing

R. Gioiosa    Pacific Northwest National Laboratory, Richland, WA, United States

Abstract

Supercomputing systems are essential for making progress in the areas of science and industries, from quantum mechanics, to oil and gas exploration. The ever-increasing demand of computing power has driven the development of extreme large systems that consists of millions of common-off-the-shelf components. As those systems approach technology and operational cost limits, new challenges arise, especially in the area of resilience, which make supercomputing environments extremely unstable and unreliable.

This chapter reviews the intrinsic characteristics of high-performance applications and how faults occurring ...

Get Rugged Embedded Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.