Defining Fault Management
Detecting and reporting unusual or unacceptable behavior is generally referred to as fault management (or event management). A fault is any behavior different from specified or expected behavior, and generally is used to refer to the complete failure of a hardware component or software product.
Fault conditions can be characterized in many different ways. Faults can be caused by hardware component failures in the environment, or by the failure of software running on systems within the environment. A computer is dependent on more than the CPU and memory; for example, power supplies and fans can also fail. Loss of power in the data center, natural disasters, and the failure of air conditioning units are just a few examples ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access