Chapter 2. Introduction to GPFS 47
2.2.6 GPFS internal recovery procedures
In this section we want to answer the following question: How does a failure of a
GPFS management function (CfgMgr, FSMgr, MN) affect I/O operations?
For the three management functions, we are investigating the following three
1. Metanode (MN) fails, but this node does not act as CfgMgr or FSMgr node
2. File system manager (FSMgr) fails, but this node does not act as CfgMgr
3. Configuration manager (CfgMgr) fails
In case a node handles multiple management functions, (for example, is a
metanode and FSMgr), all rules of both conditions take place.
Scenario 1: Metanode fails
In case the MN fails, the CfgMgr invokes recovery for each of the file systems
that were mounted on the failed node. The recovery is done by the FSMgr for
each file system (mounted on the failing node) to ensure the failed node no
longer has access to the disks belonging to those file systems.
The recovery will rebuild the metadata that was being modified at the time of the
failure to a consistent state with the possible exception that blocks that might
have been allocated but are not part of any file and are effectively lost until a file
system check (mmfsck) is run, either online or offline.
Note: The time intervals shown in the previous table are calculated based on
the assumption that a single event occurs at a time. If multiple events occur,
they will be processed in sequence, therefore the cluster and/or file system
recovery times may vary.
The calculation is valid for GPFS using the disk leasing mechanism, as the
hardware reservation mechanism is not a valid option starting GPFS V2.3.
Attention: Only one failure (state change), such as the loss or initialization of
a node, can be processed at a time, and subsequent changes will be queued.
This means that the entire failure processing must complete before the failed
node can join the group again. All failures are processed first, regardless the
type of failure, which means that GPFS will handle all failures prior to
completing any recovery.