Chapter 12. Fault Management

IN THIS CHAPTER

  • Predictive self-healing

  • Fault management overview

  • Fault management commands

  • Using fault management

Computer systems can fail in myriad ways. Computer hardware suffers from physical limitations and wear and tear that limit its lifetime. From disks to processors to network cards, it's not a question of whether your hardware will fail, it's a question of when. Although software doesn't have the same physical limitations as hardware, it has its own share of problems. Bugs in applications, device drivers, file systems, system software, and any other software component can cause diverse kinds of failures.

Luckily, OpenSolaris provides substantial infrastructure for reliability, availability, and serviceability (RAS) in the presence of these inevitable faults. Predictive self-healing, described in this chapter and Chapter 13, provides a unified approach to fault management and service management in OpenSolaris. The observability tools, presented in Chapter 14, enable administrative monitoring and troubleshooting. In addition, the innovative Dynamic Tracing facility (DTrace), covered in Chapter 15, enables administrators to troubleshoot complex problems on live systems. Finally, the layered Open High Availability Cluster software, described in Chapter 16, enables you to group multiple physical OpenSolaris machines to obtain even higher availability of your system as a whole.

Predictive Self-Healing

Traditionally, UNIX systems handle hardware and software ...

Get OpenSolaris™ Bible now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.