Cluster monitoring and health checking
Monitoring is an important task that helps in maintaining cluster resources and ensures its serviceability to users. Monitoring involves resource control and health check tasks on the various nodes (compute, login, and services), networking devices, and storage systems.
This chapter introduces some tools and resources that can be employed to support monitoring of a high-performance computing (HPC) cluster.
This chapter includes the following topics: