Noam Palatin, Arie Leizarowitz, Assaf Schuster and Ran Wolff
This chapter describes the Grid Monitoring System (GMS) – a system which adopts a distributed data mining approach to detection of misconfigured grid machines. The GMS non-intrusively collects data from sources available throughout the grid system. It converts raw data to semantically meaningful data and stores these data on the machine from which, it was obtained limiting incurred overhead and allowing scalability. When analysis is requested, a distributed outliers detection algorithm is employed to identify misconfigured machines. The algorithm itself is implemented as a recursive workflow of grid jobs and is especially suited to grid systems in which the machines might be unavailable most of the time or often fail altogether.
Grid systems are notoriously difficult to manage. First, they often suffer more faults than typify other large systems: their hardware is typically more heterogeneous, as are the applications they execute, and since they often pool resources that belong to several administrative domains, there is no single authority able to enforce maintenance standards (e.g. with respect to software updates.) Once problems occur, the huge complexity of the system makes it very hard to track them down and explain them. Maintenance, therefore, is often catastrophe driven and rarely preventive.
The scale of a typical grid system and the ...