Chapter 6. Problem determination 207
You only need to be logged in as hscroot and open a shell window to issue this
command. After the files are generated, they can be FTP’d to IBM for analysis.
Be sure to note in your PMR that you have uploaded the file to support.
6.3 AIX troubleshooting
An extensive guide to AIX troubleshooting can be found at:
http://publib.boulder.ibm.com/infocenter/eserver/v1r3s/index.jsp?topic=
/iphau/referenceinfiniband.htm
Some of the common problems we have observed are shown in the following
sections.
Cable disconnected
Log in to the Flexible Service Processor’s (FSP) Advanced System Management
Interface (ASMI) from the HMC. In the GUI, select Service Applications
Service Focal Point Service Utilities, select CEC, and select Selected
Launch ASM Menu.
From the ASMI, you can look into the Error/Event log and search for a B70069Ex
entry. These entries are “Informational” but will point to a possible InfiniBand
problem. When an InfiniBand cable is disconnected, a B70069E6 entry will
appear. When the cable has been reconnected, a B70069E9 code will appear in
the FSP error/event log. The meaning of these entries is shown in Table 6-1.
Table 6-1 B70069Ex codes
Reference code Description
B700 69E6 The cable connection to the InfiniBand
switch is broken. Verify that the cable is
connected and the switch is powered on.
After repairing the problem, verify that the
connection has been restored by using the
InfiniBand Network Manager.
B700 69E9 Initial communication has occurred with
the Subnet Manager. No service action
required. The CEC and switch firmware
are now communicating.
208 Implementing InfiniBand on IBM System p
Interface does not configure and goes to the stopped/defined state
򐂰 Check if the HCA driver is available using the lsdev –Cc adapter command.
If the HCA Adapter it is not available, and the adapter is a Galaxy adapter,
check the HMC partition properties (not the profile) in the hardware tab.
Check if the HCA tab is there. If it is not there, the partition does not recognize
the Galaxy HCA adapter.
򐂰 Check the Galaxy adapters to see if the HMC LPAR’s properties have a GUID
index assigned to the adapter.
򐂰 Check if the ICM is configured by using lsdev –C | grep icm.
򐂰 If ICM is not configured, run smitty icm.
򐂰 Check if the IB interface parameters are within the range.
A well known P_KEY is 0xFFFF or 0x7FFF.
򐂰 Check the port range supported by the adapter (use F4 when possible to see
the defaults).
򐂰 Check that the MTU is within the range (32 to 2044).
򐂰 Check if AIX is running in 64-bit kernel mode and if it is V5.3 TL5 or later.
Ping does not work and ARP table shows incomplete entries
An ARP request packet was sent to the network, and the interface is waiting for
the ARP reply.
Probable causes:
򐂰 The remote system or interface is not UP and RUNNING.
Run ifconfig ib0 remotely to see if it has the UP and RUNNING flags set.
򐂰 The local or remote interfaces are not in the broadcast multicast group.
The switch may have kicked out the interface from the multicast group if there
are communication problems. Access to the switch will show the multicast
groups and the port (GID) belonging to it.

Get Implementing InfiniBand on IBM System p now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.