Chapter 2. Introduction to GPFS 51
This behavior poses the problem that in case of multiple disks failure, the file
system may be unmounted due to the lack of FSDesc quorum (even though the
total number of disks still available for that file system is larger that 50%+1).
To solve this problem, GPFS uses disk failure groups. The characteristics of the
failure groups are:
A number identifying the failure group to which a disk belongs
Is used by GPFS during metadata and data placement on the disks of a file
system
Ensures that no two replicas of the same block will become unavailable due to
a single failure
Are set by GPFS by default
Can also be defined either at NSD creation time (mmcrnsd/mmcrvsd
commands) or later (mmchdisk command)
The syntax for the disk descriptor file is the following
DiskName:PrimaryNSDServer:BackupNSDServer:DiskUsage:FailureGroup:NSDName
It is important to set failure groups correctly to have proper/effective file
system replication (metadata and/or data)
The following paragraphs explain in more detail considerations for choosing the
hardware and the failure groups to provide maximum availability for your file
system.
Data replication
The GPFS replication feature allows you to specify how many copies of a file to
maintain. File system replication assures that the latest updates to critical data
are preserved in the event of hardware failure. During configuration, you assign a
replication factor to indicate the total number of copies you want to store,
currently only two copies are supported.
Replication allows you to set different levels of protection for each file or one level
for an entire file system. Since replication uses additional disk capacity and
stresses the available write throughput, you might want to consider replicating
only file systems that are frequently read from but seldom written to or only for
metadata. In addition, only the primary copy is used for reading, unlike for
instance in RAID 1, hence no read performance increase can be achieved.
Failure groups
GPFS failover support allows you to organize your hardware into a number of
failure groups to minimize single points of failure. A failure group is a set of disks
that share a common point of failure that could cause them all to become
simultaneously unavailable, such as the disk enclosures, RAID controllers or
storage units such as the DS4500.