Monitoring is an important activity to assist in the maintenance of cluster resources, and to ensure its serviceability to end users. It is composed of resource control and health check tasks on the various nodes (compute, login, services), networking devices, and storage systems.
This topic is vast and deserves an entire book itself. Instead of covering this subject in detail, this chapter introduces some tools and resources that can be employed to support monitoring of a high performance computing (HPC) cluster.
This chapter ...