266 Implementing the IBM General Parallel File System (GPFS) in a Cross-Platform Environment
usa: Tue Jul 27 08:16:47.433 2010: r16: 0x0000000110D44E70 r17: 0x0000000000000000
usa: Tue Jul 27 08:16:47.441 2010: r18: 0x0000000110000DA0 r19: 0x0000000000000000
usa: Tue Jul 27 08:16:47.449 2010: r20: 0x0000000000000002 r21: 0x0000000110BFF98D
usa: Tue Jul 27 08:16:47.457 2010: r22: 0x000000000000000E r23: 0xFFFFFFFFFFFFFFFF
usa: Tue Jul 27 08:16:47.465 2010: r24: 0x0000000000000000 r25: 0x0000000110FC379B
usa: Tue Jul 27 08:16:47.473 2010: r26: 0x0000000110FC3790 r27: 0x0000000000000001
usa: Tue Jul 27 08:16:47.481 2010: r28: 0x0000000110FCAA78 r29: 0x0000000000000006
usa: Tue Jul 27 08:16:47.493 2010: r30: 0x0000000000000006 r31: 0x0000000000005354
usa: Tue Jul 27 08:16:47.509 2010: iar: 0x0900000000712650 msr: 0xA00000000000D032
usa: Tue Jul 27 08:16:47.521 2010: cr: 0x0000000000002000 link: 0xFFFFFFFFFFFFFFFF
usa: Tue Jul 27 08:16:47.529 2010: ctr: 0xFFFFFFFF006FA300 xer: 0x00000000FFFFFFFF
usa: Tue Jul 27 08:16:47.537 2010: exad: 0x0000000110F528C8
usa: Tue Jul 27 08:16:47.545 2010: 0x900000000712650 pthread_kill() + 0xB0
usa: Tue Jul 27 08:16:47.553 2010: 0x900000000711EC8 _p_raise() + 0x48
usa: Tue Jul 27 08:16:47.561 2010: 0x90000000002BD2C raise() + 0x4C
usa: Tue Jul 27 08:16:47.569 2010: 0x900000000088504 abort() + 0xC4
usa: Tue Jul 27 08:16:47.577 2010: 0x900000000088344 __assert_c99() + 0x2E4
usa: Tue Jul 27 08:16:47.585 2010: 0x100005C54 logAssertFailed() + 0x244
usa: Tue Jul 27 08:16:47.593 2010: 0x10002CD0C runTSTest(int,const char*,int&) + 0xFD8
usa: Tue Jul 27 08:16:47.601 2010: 0x100035E94 runTSDebug(int,int,char**) + 0x380
usa: Tue Jul 27 08:16:47.609 2010: 0x100426B14 RunClientCmd(MessageHeader*,IpAddr,unsigned
short,int,int,StripeGroup*,RpcContext*) + 0x8F8
usa: Tue Jul 27 08:16:47.617 2010: 0x100428418 HandleCmdMsg(void*) + 0x67C
usa: Tue Jul 27 08:16:47.625 2010: 0x10003B94C Thread::callBody(Thread*) + 0xDC
usa: Tue Jul 27 08:16:47.633 2010: 0x100003014 Thread::callBodyWrapper(Thread*) + 0xB0
usa: Tue Jul 27 08:16:47.641 2010: 0x9000000006FAD50 _pthread_body() + 0xF0
usa: Tue Jul 27 08:16:47.649 2010: 0xFFFFFFFFFFFFFFFC
6.3 GPFS problem scenarios
Given the data collection and tools described in previous sections, this section provides the
steps to use when collecting documentation for common GPFS problem scenarios.
6.3.1 Considerations
This section provides details about the areas to consider when challenges occur in your
environment.
File system not mounting
Several reasons exist for a file system not mounting. Depending on the scope of the problem
can determine where the focus of the problem determination begins.
Reasons for not mounting are as follows:
GPFS quorum (Quorum is defined as one plus half of the explicitly defined quorum nodes
in the GPFS cluster.)
Node rNode recovery
Down NSDs
DMAPI is enabled
Chapter 6. Problem determination 267
If the file system not mounting is cluster-wide, such as after rebooting a cluster, focus on
determining whether the cluster quorum is met. If the quorum semantics are broken, GPFS
performs recovery in an attempt to achieve quorum again.
If the file system that is not mounting is limited to one or a few nodes in the cluster, the
problem determination process should begin by looking at the mmfs logs and waiters from all
the nodes.
Quorum
Issues with quorum state at cluster startup time are noticed as the node or nodes are not
joining the cluster or a file system is not mounting. GPFS quorum must be maintained within
the cluster for GPFS to remain active. Use the commands as follows to check the quorum
states:
Use the following command to determine how many and which nodes are quorum nodes:
mmlscluster | grep quorum
Use the following command to determine state (down, arbitrating, or active) of these
nodes:
mmgetstate -a
The formula for quorum is as follows:
(#quorumNodes/2) +1
If quorum is not met, the cluster does not form and the file systems does not mount.
GPFS file system hangs
The general data collection steps when a GPFS file system hang occurs are as follows:
1. Gather waiters from all nodes (see 6.2.1, “Data collection commands” on page 258).
2. From all node with waiters, use mmfsadm dump all.
3. If any remote clusters exists, perform steps 1 and 2 from those clusters.
4. Use gpfs.snap from one node; use gpfs.snap -z from all other nodes.
GPFS related command hangs
The general data collection steps when file-system-related commands or scripts hang (such
as ls, cp, mv, mkdir, and so on) are as follows:
1. Check for long waiters. If waiters exist, treat the problem as a possible hang.
2. If no waiters exist, continue and re-create the capturing of the trace.
3. Export the appropriate TRCFILESIZE and TRCBUFSIZE buffers.
4. Start the mmfs tracing with the mmtrace command.
5. Re-create the error.
6. Stop mmfs tracing.
7. Gather the data with the mmfsadm dump all command.
GPFS administration command hangs
The general data error collection steps when a GPFS administration command hangs are as
follows:
1. Run the script /tmp/mmtrace.out
2. Recreate the capturing of the trace.
3. Set the export DEBUG=1
4. Recreate the error again
5. Exit the trace
6. Run the gpfs.snap script

Get Implementing the IBM General Parallel File System (GPFS) in a Cross Platform Environment now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.