CHAPTER 8

Software Detection and Recovery

8.1 Overview

Software techniques to detect transient faults and correct corresponding errors are gaining popularity. Recently, Marathon Technologies introduced its EverRun servers in which fault tolerance is implemented completely in software [15]. Software fault-tolerance schemes typically degrade performance more than the hardware techniques described in previous chapters because of the inherent overhead incurred by software techniques. Further, in some cases, pure software implementations may not have enough visibility into the hardware to provide adequate fault coverage.

Nevertheless, software fault-tolerance schemes can still be attractive for several reasons. Software schemes are cheaper to implement ...

Get Architecture Design for Soft Errors now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.