Search the Catalog
Managing NFS and NIS, 2nd Edition

Managing NFS and NIS, 2nd Edition

By Hal Stern, Mike Eisler & Ricardo Labiaga
2nd Edition June 2001 (est.)
1-56592-510-6, Order Number: 5106
510 pages (est.), $39.95 (est.)

Chapter 15
Debugging Network Problems

In this chapter:
Duplicate ARP Replies
Renegade NIS Server
Boot Parameter Confusion
Incorrect Directory Content Caching
Incorrect Mountpoint Permissions
Asynchronous NFS Error Messages

This chapter consists of case studies in network problem analysis and debugging, ranging from Ethernet addressing problems to a machine posing as an NIS server in the wrong domain. This chapter is a bridge between the formal discussion of NFS and NIS tools and their use in performance analysis and tuning. The case studies presented here walk through debugging scenarios, but should give you an idea of how the various tools work together.

When debugging a network problem, it's important to think about the potential cause of a problem, and use that to start ruling out other factors. For example, if your attempts to bind to an NIS server are failing, you should know that you could try testing the network using ping; the health of ypserv processes using rpcinfo, and finally the binding itself with ypset. Working your way through the protocol layers ensures that you don't miss a low-level problem that is posing as a higher-level failure. Keeping with that advice, we'll start by looking at a network layer problem.

Duplicate ARP Replies

ARP misinformation was briefly mentioned in , and this story showcases some of the baffling effects it creates. A network of two servers and ten clients suddenly began to run very slowly, with the following symptoms:

The local area network with the problems was joined to the campus-wide backbone via a bridge. An identical network of machines, running the same applications with nearly the same configuration, was operating without problems on the far side of the bridge. We were assured of the health of the physical network by two engineers who had verified physical connections and cable routing.

The very sporadic nature of the problem -- and the fact that it resolved itself over time -- pointed toward a problem with ARP request and reply mismatches. This hypothesis neatly explained the extraordinarily slow loading of the application: a client machine trying to read the application executable would do so by issuing NFSv2 requests over UDP. To send the UDP packets, the client would ARP the server, randomly get the wrong reply, and then be unable to use that entry for several minutes. When the ARP table entry had aged and was deleted, the client would again ARP the server; if the correct ARP response was received then the client could continue reading pages of the executable. Every wrong reply received by the client would add a few minutes to the loading time.

There were several possible sources of the ARP confusion, so to isolate the problem, we forced a client to ARP the server and watched what happened to the ARP table:

# arp -d muskrat 
muskrat (139.50.2.1) deleted 
# ping -s muskrat 
PING muskrat: 56 data bytes 
No further output from ping

By deleting the ARP table entry and then directing the client to send packets to muskrat, we forced an ARP of muskrat from the client. ping timed out without receiving any ICMP echo replies, so we examined the ARP table and found a surprise:

# arp -a | fgrep muskrat 
le0   muskrat               255.255.255.255       08:00:49:05:02:a9

Since muskrat was a Sun workstation, we expected its Ethernet address to begin with 08:00:20 (the prefix assigned to Sun Microsystems), not the 08:00:49 prefix used by Kinetics gateway boxes. The next step was to figure out how the wrong Ethernet address was ending up in the ARP table: was muskrat lying in its ARP replies, or had we found a network imposter?

Using a network analyzer, we repeated the ARP experiment and watched ARP replies returned. We saw two distinct replies: the correct one from muskrat, followed by an invalid reply from the Kinetics FastPath gateway. The root of this problem was that the Kinetics box had been configured using the IP broadcast address 0.0.0.0, allowing it to answer all ARP requests. Reconfiguring the Kinetics box with a non-broadcast IP address solved the problem.

The last update to the ARP table is the one that "sticks," so the wrong Ethernet address was overwriting the correct ARP table entry. The Kinetics FastPath was located on the other side of the bridge, virtually guaranteeing that its replies would be the last to arrive, delayed by their transit over the bridge. When muskrat was heavily loaded, it was slow to reply to the ARP request and its ARP response would be the last to arrive. Reconfiguring the Kinetics FastPath to use a proper IP address and network mask cured the problem.

ARP servers that have out-of-date information create similar problems. This situation arises if an IP address is changed without a corresponding update of the server's published ARP table initialization, or if the IP address in question is re-assigned to a machine that implements the ARP protocol. If an ARP server was employed because muskrat could not answer ARP requests, then we should have seen exactly one ARP reply, coming from the ARP server. However, an ARP server with a published ARP table entry for a machine capable of answering its own ARP requests produces exactly the same duplicate response symptoms described above. With both machines on the same local network, the failures tend to be more intermittent, since there is no obvious time-ordering of the replies.

There's a moral to this story: you should rarely need to know the Ethernet address of a workstation, but it does help to have them recorded in a file or NIS map. This problem was solved with a bit of luck, because the machine generating incorrect replies had a different manufacturer, and therefore a different Ethernet address prefix. If the incorrectly configured machine had been from the same vendor, we would have had to compare the Ethernet addresses in the ARP table with what we believed to be the correct addresses for the machine in question.

Renegade NIS Server

A user on our network reported that he could not log into his workstation. He supplied his username and the same password he'd been using for the past six months, and he consistently was told "Login incorrect." Out of frustration, he rebooted his machine. When attempting to mount NFS filesystems, the workstation was not able to find any of the NFS server hosts in the hosts NIS map, producing errors of the form:

nfs mount: wahoo: : RPC: Unknown host

There were no error messages from ypbind, so it appeared that the workstation had found an NIS server. The culprit looked like the NIS server itself: our guess was that it was a machine masquerading as a valid NIS server, or that it was an NIS server whose maps had been destroyed. Because nobody could log into the machine, we rebooted it in single-user mode, and manually started NIS to see where it bound:

Single-user boot 
# /etc/init.d/inetinit start
NIS domainname is nesales
Starting IPv4 router discovery.
Starting IPv6 neighbor discovery.
Setting default IPv6 interface for multicast: add net ff00::/8: gateway fe80::a00:20ff:fea0:3390
# /etc/init.d/rpc start
starting rpc services: rpcbind keyserv ypbind done.
# ypwhich
131.40.52.25

We manually invoked the /etc/init.d/inetinit startup script to initialize the domain name and configure the routing. We then invoked the /etc/init.d/rpc script to start ypbind. Notice that ypwhich was not able to match the IP address of the NIS server in the hosts NIS map, so it printed the IP address. The IP address belonged to a gateway machine that was not supposed to be a NIS server. It made sense that clients were binding to it, if it was posing as an NIS server, since the gateway was very lightly loaded and was probably the first NIS server to respond to ypbind requests.

We logged into that machine, and verified that it was running ypserv. The domain name used by the gateway was nesales -- it had been brought up in the wrong domain. Removing the /var/yp/nesales subdirectory containing the NIS maps and restarting the NIS daemons took the machine out of service:

# cd /var/yp 
# rm -rf nesales 
# /usr/lib/netsvc/yp/ypstop
# /usr/lib/netsvc/yp/ypstart

We contacted the person responsible for the gateway and had him put the gateway in its own NIS domain (his original intention). Machines in nesales that had bound to the renegade server eventually noticed that their NIS server had gone away, and they rebound to valid servers.

As a variation on this problem, consider an NIS server that has damaged or incomplete maps. Symptoms of this problem are nearly identical to those previously described, but the IP address printed by ypwhich will be that of a familiar NIS server. There may be just a few maps that are damaged, possibly corrupted during an NIS transfer operation, or all of the server's maps may be corrupted or lost. The latter is most probable when someone accidentally removes directories in /var/yp.

To check the consistency of various maps, use ypcat to dump all of the keys known to the server. A few damaged maps can be replaced with explicit yppush operations on the master server. If all of the server's maps are damaged, it is easiest to re-initialize the server. Slave servers are easily rebuilt from a valid master server, but if the master server has lost the DBM files containing the maps, initializing the machine as an NIS master server regenerates only the default set of maps. Before rebuilding the NIS master, save the NIS Makefile, in /var/yp or /etc/yp, if you have made local changes to it. The initialization process builds the default maps, after which you can replace your hand-crafted Makefile and build all site-specific NIS maps.

Boot Parameter Confusion

Different vendors do not always agree on the format of responses to various broadcast requests. Great variation exists in the bootparam RPC service, which supplies diskless nodes with the name of their boot server, and pathname for their root partition. If a diskless client's request for boot parameters returns a packet that it cannot understand, the client produces a rather cryptic error message and then aborts the boot process.

As an example, we saw the following strange behavior when a diskless Sun workstation attempted to boot. The machine would request its Internet address using RARP, and receive the correct reply from its boot server. It then downloaded the boot code using tftp, and sent out a request for boot parameters. At this point, the boot sequence would abort with one of the errors:

null domain name 
invalid reply

Emulating the request for boot parameters using rpcinfo located the source of the invalid reply quickly. Using a machine close to the diskless node, we sent out a request similar to that broadcast during the boot sequence, looking for bootparam servers:

% rpcinfo -b bootparam 1 
192.9.200.14.128.67   clover 
192.9.200.1.128.68    lucy 
192.9.200.4.128.79    bugs

lucy and bugs were boot and root/swap servers for diskless clients, but clover was a machine from a different vendor. It should not have been interested in the request for boot parameters. However, clover was running rpc.bootparamd, which made it listen for boot parameter requests, and it used the NIS bootparams map to glean the boot information. Unfortunately, the format of its reply was not digestible by the diskless Sun node, but its reply was the first to arrive. In this case, the solution merely involved turning off rpc.bootparamd by commenting it out of the startup script on clover.

If clover supported diskless clients of its own, turning off rpc.bootparamd would not have been an acceptable solution. To continue running rpc.bootparamd on clover, we would have had to ensure that it never sent a reply to diskless clients other than its own. The easiest way to do this is to give clover a short list of clients to serve, and to keep clover from using the bootparams NIS map.[1]

Incorrect Directory Content Caching

A user of a Solaris NFS client reported having intermittent problems accessing files mounted from a non-Unix NFS server. The Solaris NFS client tarsus was apparently able to list files that had previously been removed by another NFS client, but was unable to access the contents of the files. The files would eventually disappear. The NFS client that initially removed the files did not experience any problems and the user reported that the files had indeed been removed from the server's directory. He verified this by logging into the NFS server and listing the contents of the exported directory.

We suspected the client tarsus was not invalidating its cached information, and proceeded to try to reproduce the problem while capturing the NFS packets to analyze the network traffic:

[1] tarsus$ ls -l /net/inchun/export/folder
total 8
-rw-rw-rw-   1 labiaga  staff       2883 Apr 10 20:03 data1
-rw-rw-rw-   1 root     other         12 Apr 10 20:01 data2
 
[1] protium$ rm /net/inchun/export/folder/data2
 
[2] tarsus$ ls /net/inchun/export/folder
data1   data2
[3] tarsus$ ls -l /net/inchun/export/folder
/net/inchun/export/folder/data2: Stale NFS file handle
total 6
-rw-rw-rw-   1 labiaga  staff       2883 Apr 10 20:03 data1

The first directory listing on tarsus correctly displayed the contents of the NFS directory /net/inchun/export/folder before anything was removed. The problems began after the NFS client protium removed the file data2. The second directory listing on tarsus continued showing the recently removed data2 file as part of the directory, although the extended directory listing reported a "Stale NFS file handle" for data2.

This was a typical case of inconsistent caching of information by an NFS client. Solaris NFS clients cache the directory content and attribute information in memory at the time the directory contents are first read from the NFS server. Subsequent client accesses to the directory first validate the cached information, comparing the directory's cached modification time to the modification time reported by the server. A match in modification times indicates that the directory has not been modified since the last time the client read it, therefore it can safely use the cached data. On the other hand, if the modification times are different, the NFS client purges its cache, and issues a new NFS Readdir request to the server to obtain the updated directory contents and attributes. Some non-Unix NFS servers are known for not updating the modification time of directories when files are removed, leading to directory caching problems. We used snoop to capture the NFS packets between our client and server while the problem was being reproduced. The analysis of the snoop output should help us determine if we're running into this caching problem.

To facilitate the discussion, we list the snoop packets preceded by the commands that generated them. This shows the correlation between the NFS traffic and the Unix commands that generate the traffic:

[1] tarsus $ ls -l /net/inchun/export/folder
total 8
-rw-rw-rw-   1 labiaga  staff       2883 Apr 10 20:03 data1
-rw-rw-rw-   1 root     other         12 Apr 10 20:01 data2
 
 7   0.00039    tarsus -> inchun    NFS C GETATTR2 FH=FA14
 8   0.00198    inchun -> tarsus    NFS R GETATTR2 OK
 9   0.00031    tarsus -> inchun    NFS C READDIR2 FH=FA14 Cookie=0
10   0.00220    inchun -> tarsus    NFS R READDIR2 OK 4 entries (No more)
11   0.00033    tarsus -> inchun    NFS C LOOKUP2 FH=FA14 data2
12   0.00000    inchun -> tarsus    NFS R LOOKUP2 OK FH=F8CD
13   0.00000    tarsus -> inchun    NFS C GETATTR2 FH=F8CD
14   0.00000    inchun -> tarsus    NFS R GETATTR2 OK
15   0.00035    tarsus -> inchun    NFS C LOOKUP2 FH=FA14 data1
16   0.00211    inchun -> tarsus    NFS R LOOKUP2 OK FH=F66F
17   0.00032    tarsus -> inchun    NFS C GETATTR2 FH=F66F
18   0.00191    inchun -> tarsus    NFS R GETATTR2 OK

Packets 7 and 8 contain the request and reply for attributes for the /net/inchun/export/folder directory. The attributes can be displayed by using the -v directive:

Excerpt of: snoop -i /tmp/capture -p 7,8 -v
ETHER:  ----- Ether Header -----
ETHER:  
ETHER:  Packet 8 arrived at 20:45:9.75
...
NFS:  ----- Sun NFS -----
NFS:  
NFS:  Proc = 1 (Get file attributes)
NFS:  Status = 0 (OK)
NFS:  File type = 2 (Directory)
NFS:  Mode = 040777
NFS:   Type = Directory
NFS:   Setuid = 0, Setgid = 0, Sticky = 0
NFS:   Owner's permissions = rwx
NFS:   Group's permissions = rwx
NFS:   Other's permissions = rwx
NFS:  Link count = 2, UID = 0, GID = -2, Rdev = 0x0
NFS:  File size = 512, Block size = 512, No. of blocks = 1
NFS:  File system id = 7111, File id = 161
NFS:  Access time       = 11-Apr-00 12:50:18.000000 GMT
NFS:  Modification time = 11-Apr-00 12:50:18.000000 GMT
NFS:  Inode change time = 31-Jul-96 09:40:56.000000 GMT

Packet 8 shows the /net/inchun/export/folder directory was last modified on April 11, 2000 at 12:50:18.000000 GMT. tarsus caches this timestamp to later determine when the cached directory contents need to be updated. Packet 9 contains the request made by tarsus for the directory listing from inchun. Packet 10 contains inchun's reply with four entries in the directory. A detailed view of the packets shows the four directory entries: ".", "..", "data1", and "data2". The EOF indicator notifies the client that all existing directory entries have been listed, and there is no need to make another NFS Readdir call:

Excerpt from: snoop -i /tmp/capture -p 9,10 -v
ETHER:  ----- Ether Header -----
ETHER:  
ETHER:  Packet 10 arrived at 20:45:9.74
...
NFS:  ----- Sun NFS -----
NFS:  
NFS:  Proc = 16 (Read from directory)
NFS:  Status = 0 (OK)
NFS:   File id  Cookie Name
NFS:       137   50171 .
NFS:       95    50496 ..
NFS:       199   51032 data1
NFS:       201   51706 data2
NFS:    4 entries
NFS:  EOF = 1
NFS:  

The directory contents are cached by tarsus, so that they may be reused in a future directory listing. The NFS Lookup and NFS Getattr requests, along with their corresponding replies in packets 11 thru 18 result from the long listing of the directory requested by ls -l. An NFS Lookup obtains the file handle of a directory component. The NFS Getattr requests the file attributes of the file identified by the previously obtained file handle[2].

Next, protium, a different NFS client comes into the picture, and removes one file from the directory previously cached by tarsus:

[1] protium $ rm /net/inchun/export/folder/data2
 
22   0.00000   protium -> inchun    NFS C GETATTR2 FH=FA14
23   0.00000    inchun -> protium   NFS R GETATTR2 OK
24   0.00000   protium -> inchun    NFS C REMOVE2 FH=FA14 data2
25   0.00182    inchun -> protium   NFS R REMOVE2 OK

Packets 22 and 23 update the cached attributes of the /net/inchun/export/folder directory on protium. Packet 24 contains the actual NFS Remove request sent to inchun, which in turn acknowledges the successful removal of the file in packet 25.

tarsus then lists the directory in question, but fails to detect that the contents of the directory have changed:

[2] tarsus $ ls /net/inchun/export/folder
data1   data2
 
39   0.00000    tarsus -> inchun    NFS C GETATTR2 FH=FA14
40   0.00101    inchun -> tarsus    NFS R GETATTR2 OK

This is where the problem begins. Notice that two NFS Getattr network packets are generated as a result of the directory listing but no Readdir request. In this case, the client issues the NFS Getattr operation to request the directory's modification time:

Excerpt of: snoop -i /tmp/capture -p 39,40 -v
ETHER:  ----- Ether Header -----
ETHER:  
ETHER:  Packet 40 arrived at 20:45:10.88
...
NFS:  ----- Sun NFS -----
NFS:  
NFS:  Proc = 1 (Get file attributes)
NFS:  Status = 0 (OK)
NFS:  File type = 2 (Directory)
NFS:  Mode = 040777
NFS:   Type = Directory
NFS:   Setuid = 0, Setgid = 0, Sticky = 0
NFS:   Owner's permissions = rwx
NFS:   Group's permissions = rwx
NFS:   Other's permissions = rwx
NFS:  Link count = 2, UID = 0, GID = -2, Rdev = 0x0
NFS:  File size = 512, Block size = 512, No. of blocks = 1
NFS:  File system id = 7111, File id = 161
NFS:  Access time       = 11-Apr-00 12:50:18.000000 GMT
NFS:  Modification time = 11-Apr-00 12:50:18.000000 GMT
NFS:  Inode change time = 31-Jul-96 09:40:56.000000 GMT

The modification time of the directory is the same as the modification time before the removal of the file! tarsus compares the cached modification time of the directory with the modification time just obtained from the server, and determines that the cached directory contents are still valid since the modification times are the same. The directory listing is therefore satisfied from the cache instead of forcing the NFS client to read the updated directory contents from the server. This explains why the removed file continues to show up in the directory listing.

[3] tarsus $ ls -l /net/inchun/export/folder
/net/inchun/export/folder/data2: Stale NFS file handle
total 6
-rw-rw-rw-   1 labiaga  staff       2883 Apr 10 20:03 data1
 
44   0.00000    tarsus -> inchun    NFS C GETATTR2 FH=FA14
45   0.00101    inchun -> tarsus    NFS R GETATTR2 OK
46   0.00032    tarsus -> inchun    NFS C GETATTR2 FH=F66F
47   0.00191    inchun -> tarsus    NFS R GETATTR2 OK
48   0.00032    tarsus -> inchun    NFS C GETATTR2 FH=F8CD
49   0.00214    inchun -> tarsus    NFS R GETATTR2 Stale NFS file handle

The directory attributes reported in packet 45 are the same as those seen in packet 40, therefore tarsus assumes that it can safely use the cached file handles associated with the cached entries of this directory. In packet 46, tarsus requests the attributes of file handle F66F, corresponding to the data1file. The server replies with the attributes in packet 47. tarsus then proceeds to request the attributes of file handle F8CD, which corresponds to the data2 file. The server replies with a "Stale NFS file handle" error because there is no file on the server associated with the given file handle. This problem would never have occurred had the server updated the modification time after removing the file causing tarsus to detect that the directory had been changed.

Directory caching works nicely when the NFS server obeys Unix directory semantics. Many non-Unix NFS servers provide such semantics even if they have to submit themselves to interesting contortions. Having said this, there is nothing in the NFS protocol specification that requires the modification time of a directory to be updated when a file is removed. You may therefore need to disable Solaris NFS directory caching if you're running into problems interacting with non-Unix servers. To permanently disable NFS directory caching, add the following line to /etc/system:

set nfs:nfs_disable_rddir_cache = 0x1

The Solaris kernel reads /etc/system at startup and sets the value of nfs_disable_rddir_cache to 0x1 in the nfs kernel module. The change takes effect only after reboot. Use adb to disable caching during the current session, postponing the need to reboot. You still need to set the tunable in /etc/system to make the change permanent through reboots:

aqua# adb -w -k /dev/ksyms /dev/mem
physmem    3ac8
nfs_disable_rddir_cache/W1
nfs_disable_rddir_cache:    0x0        =    0x1

adb is an interactive assembly level debugger that enables you to consult and modify the kernel's memory contents. The -k directive instructs adb to perform kernel memory mapping accessing the kernel's memory via /dev/mem, and obtaining the kernel's symbol table from /dev/ksyms. The -w directive allows you to modify the kernel memory contents. A word of caution, adb is a power tool that will cause serious data corruption and potential system panics when misused.

Incorrect Mountpoint Permissions

Not all problems involving NFS filesystems originate on the network or other file servers. NFS filesystems closely resemble local filesystems, consequently common local system administration concepts and problem solving techniques apply to NFS mounted filesystems as well. A user reported problems resolving the "current directory" when inside an NFS mounted filesystem. The filesystem was automounted using the following direct map:

Excerpt from /etc/auto_direct:
/packages       -ro           aqua:/export

The user was able to cd into the directory and list the directory contents except for the ".." entry. In addition, he was not able to execute the pwd command when inside the NFS directory either:

$ cd /packages
$ ls -la
./..: Permission denied
total 6
drwxr-xr-x   4 root     sys          512 Oct  1 12:16 ./
drwxr-xr-x   2 root     other        512 Oct  1 12:16 pkg1/
drwxr-xr-x   2 root     other        512 Oct  1 12:16 pkg2/
$ pwd
pwd: cannot determine current directory!

He performed the same procedure as superuser and noticed that it worked correctly:

# cd /packages
# ls -la
total 8
drwxr-xr-x   4 root     sys          512 Oct  1 12:16 .
drwxr-xr-x  38 root     root        1024 Oct  1 12:14 ..
drwxr-xr-x   2 root     other        512 Oct  1 12:16 pkg1
drwxr-xr-x   2 root     other        512 Oct  1 12:16 pkg2
# pwd
/packages
# ls -ld /packages
drwxr-xr-x   4 root     sys          512 Oct  1 12:16 /packages

Note that the directory permission bits for /packages are 0755, giving read and execute permission to everyone, in addition to write permission to root, its owner. Since the filesystem permissions were not the problem, he proceeded to analyze the network traffic, suspecting that the NFS server could be returning the "Permission denied" error. snoop reported two network packets when a regular user executed the pwd command:

  1   0.00000      caramba -> aqua         NFS C GETATTR3 FH=0222
  2   0.00050         aqua -> caramba      NFS R GETATTR3 OK

Packet 1 contains caramba's request for attributes for the current directory having file handle FH=0222. Packet 2 contains the reply from the NFS server aqua:

Excerpt of packet 2:
IP:   Source address = 131.40.52.125, aqua
IP:   Destination address = 131.40.52.223, caramba
IP:   No options
IP:   
 
...
 
NFS:  ----- Sun NFS -----
NFS:  
NFS:  Proc = 1 (Get file attributes)
NFS:  Status = 0 (OK)
NFS:    File type = 2 (Directory)
NFS:    Mode = 0755
NFS:     Setuid = 0, Setgid = 0, Sticky = 0
NFS:     Owner's permissions = rwx
NFS:     Group's permissions = r-x
NFS:     Other's permissions = r-x
NFS:    Link count = 4, User ID = 0, Group ID = 3
NFS:    File size = 512, Used = 1024
NFS:    Special: Major = 0, Minor = 0
NFS:    File system id = 584115552256, File id = 74979
NFS:    Last access time      = 03-Oct-00 00:41:55.160003000 GMT
NFS:    Modification time     = 01-Oct-00 19:16:32.399997000 GMT
NFS:    Attribute change time = 01-Oct-00 19:16:32.399997000 GMT
NFS:  
NFS:  

Along with other file attributes, the NFS portion of the packet contains the mode bits for owner, group and other. These mode bits were the same as those reported by the ls -la command, so the problem was not caused by the NFS server either.

Because this was an automounted filesystem, we suggested rebooting caramba in single-user mode to look at the mountpoint itself, before the automounter had a chance to cover it with an autofs filesystem. At this point, we were able to uncover the source of the problem:

Single-user boot
# ls -ld /packages
drwx------   2 root     staff        512 Oct  1 12:14 /packages

The mountpoint had been created with 0700 permissions, refusing access to anyone but the superuser. The 0755 directory permission bits previously reported in multi-user mode were those of the NFS filesystem mounted on the /packages mountpoint. The NFS filesystem mount was literally covering up the problem.

In Solaris, a lookup of ".." in the root of a filesystem results in a lookup of ".." in the mountpoint sitting under the filesystem. This explains why users other than the superuser were unable to access the ".." directory - they did not have permission to open the directory to read and traverse it. The pwd command failed as well when it tried to open the ".." directory in order to read the contents of the parent directory on its way to the top of the root filesystem. The misconstrued permissions of the mountpoint were the cause of the problem, not the permissions of the NFS filesystem covering the mountpoint. Changing the permissions of the mountpoint to 0755 fixed the problem.

Asynchronous NFS Error Messages

This final section provides an in-depth look at how an NFS client does write-behind, and what happens if one of the write operations fails on the remote server. It is intended as an introduction to the more complex issues of performance analysis and tuning, many of which revolve around similar subtleties in the implementation of NFS.

When an application calls read( ) or write( ) on a local, or UNIX filesystem (UFS) file, the kernel uses inode and indirect block pointers to translate the offset in the file into a physical block number on the disk. A low-level physical I/O operation, such as "write this buffer of 1024 bytes to physical blocks 5678 and 5679" is then passed to the disk device driver. The actual disk operation is scheduled, and when the disk interrupts, the driver interrupt routine notes the completion of the current operation and schedules the next. The block device driver queues the requests for the disk, possibly re-ordering them to minimize disk head movement.

Once the disk device driver has a read or write request, only a media failure causes the operation to return an error status. Any other failures, such as a permission problem, or the filesystem running out of space, are detected by the filesystem management routines before the disk driver gets the request. From the point of view of the read( ) and write( ) system calls, everything from the filesystem write routine down is a black box: the application isn't necessarily concerned with how the data makes it to or from the disk, as long as it does so reliably. The actual write operation occurs asynchronously to the application calling write( ). If a media error occurs -- for example, the disk has a bad sector brewing -- then the media-level error will be reported back to the application during the next write( ) call or during the close( ) of the file containing the bad block. When the driver notices the error returned by the disk controller, it prints a media failure message on the console.

A similar mechanism is used by NFS to report errors on the "virtual media" of the remote file server. When write( ) is called on an NFS-mounted file, the data buffer and offset into the file are handed to the NFS write routine, just as a UFS write calls the lower-level disk driver write routine. Like the disk device driver, NFS has a driver routine for scheduling write requests: each new request is put into the page cache. When a full page has been written, it is handed to an NFS async thread that performs the RPC call to the remote server and returns a result code. Once the request has been written into the local page cache, the write( ) system call returns to the application -- just as if the application was writing to a local disk. The actual NFS write is synchronous to the NFS async thread, allowing these threads to perform write-behind. A similar process occurs for reads, where the NFS async thread performs some read-ahead by fetching NFS buffers in anticipation of future read( ) system calls. See , for details on the operation of the NFS async threads.

Occasionally, an NFS async thread detects an error when attempting to write to a remote server, and the error is printed (by the NFS async thread) on the client's console. The scenario is identical to that of a failing disk: the write( ) system call has already returned, so the error must be reported on the console in the next similar system call.

The format of these error messages is:

NFS write error on host mahimahi: No space left on device.
(file handle: 800006 2 a0000 3ef 12e09b14 a0000 2 4beac395)

The number of potential failures when writing to an NFS-mounted disk exceed the few media-related errors that would cause a UFS write to fail. Table 15-1 gives some examples

Table 15-1: NFS-related Errors

Error

Typical Cause

Permission Denied

Superuser cannot write to remote filesystem.

No space left on device

Remote disk is full.

Stale File Handle

File or directory has been removed on the server without the client's knowledge.

Both the "Permission Denied" and the "No space left on device" errors would have been detected on a local filesystem, but the NFS client has no way to determine if a write operation will succeed at some future time (when the NFS async thread eventually sends it to the server). For example, if a client writes out 1-Kbyte buffers, then its NFS async threads write out 8-Kbyte buffers to the server on every 8th call to write( ). Several seconds may go by between the time the first write( ) system call returns to the application and the time that the 8th call forces the NFS async thread to perform an RPC to the NFS server. In this interval, another process may have filled up the server's disk with some huge write requests, so the NFS async thread's attempt to write its 8-Kbyte buffer will fail.

If you are consistently seeing NFS writes fail due to full filesystems or permission problems, you can usually chase down the user or process that is performing the writes by identifying the file being written. Unfortunately, Solaris does not provide any utility to correlate the file handles printed in the error messages with the pathname of the file on the remote server. File handles are generated by the NFS server and handed opaquely to the NFS client. The NFS client can not make any assumptions as to the structure or contents of the file handle, enabling servers to change the way they generate the file handle at any time. In practice, the structure of a Solaris NFS file handle has changed little over time. The following script takes as input the file handle printed by the NFS client and generates the corresponding server file name[3]:

  1. #!/bin/sh
  2.  
  3. if [ $# -ne 8 ]; then
  4. echo "Usage: fhfind <filehandle> e.g."
  5. echo
  6. echo "fhfind 1540002 2 a0000 4d 48df4455 a0000 2 25d1121d"
  7. exit 1
  8. fi
  9.  
  10. FSID=$1
  11. INUMHEX='echo $4 | tr [a-z] [A-Z]'
  12.  
  13. ENTRY='grep ${FSID} /etc/mnttab | grep -v lofs'
  14. if [ "${ENTRY}" = "" ] ; then
  15. echo "Cannot find filesystem for devid ${FSID}"
  16. exit 1
  17. fi
  18. set - ${ENTRY}
  19. MNTPNT=$2
  20.  
  21. INUM='echo "ibase=16;${INUMHEX}" | bc'
  22.  
  23. echo "Searching ${MNTPNT} for inode number ${INUM} ..."
  24. echo
  25.  
  26. find ${MNTPNT} -mount -inum ${INUM} -print 2>/dev/null

The script takes the expanded file handle string from the NFS write error and maps it to the full pathname of the file on the server. The script is to be executed on the NFS server:

mahimahi# fhfind 800006 2 a0000 3ef 12e09b14 a0000 2 4beac395
Searching /spare for inode number 1007 ...
 
/spare/test/info/data

The eight values on the command line are the eight hex digits in the file handle reported in the NFS error message. The script makes strict assumptions about the contents of the Solaris server file handle. As mentioned before, the OS vendor is free to change the structure of the file handle at any time, so there's no guarantee this script will work on your particular configuration. The script takes advantage of the fact that the file handle contains the inode number of the file in question, as well as the device id of the filesystem in which the file resides. The script uses the device id in the file handle (FSID in line 10) to obtain the filesystem entry from /etc/mnttab (line 13). In line 11, the script obtains the inode number of the file (in hex) from the file handle, and applies the tr utility to convert all lower case characters into upper case characters for use with the bc calculator. Line 18 and 19 extract the mountpoint from the filesystem entry, to later use it as the starting point of the search. Line 21 takes the hexadecimal inode number obtained from the file handle, and converts it to its decimal equivalent for use by find. In line 26, we finally commence the search for the file matching the inode number. Although find uses the mountpoint as the starting point of the search, a scan of a large filesystem may take a long time. Since there's no way to terminate the find upon finding the file, you may want to kill the process after it prints the path.

Throughout this chapter we used tools presented in previous chapters to debug network and local problems. Once you determine the source of the problem, you should be able to take steps to correct and avoid it. For example, you can avoid delayed client write problems by having a good idea of what your clients are doing and how heavily loaded your NFS servers are. Determining your NFS workload and optimizing your clients and servers to make the best use of available resources requires tuning the network, the clients and the servers. The next few chapters present NFS tuning and benchmarking techniques.


1. Solaris uses the Name Switch to specify the name service used by rpc.bootparamd. Remove NIS from the bootparamsentry in /etc/nsswitch.confand remove the "+"entry from /etc/bootparamsto avoid using NIS. Once bootparamdis restarted, it will no longer use the bootparamsNIS map.

2. NFS v2 file handles are 32 bytes long. Instead of displaying a long and cryptic 32 byte number, snoop generates a short-hand version of the file handle and displays it when invoked in summary mode. This helps you associate file handles with file objects more easily. You can obtain the exact file handle by displaying the network packet in verbose mode by using the -v option. The packet 7 file handle FH=FA14 is really:

Excerpt of: snoop -i /tmp/capture -p 7 -v
NFS: ----- Sun NFS -----
NFS:
NFS:  Proc = 1 (Get file attributes)
NFS:  File handle = [FA14]
NFS:   0204564F4C32000000000000000000000000A10000001C4DFF20A00000000000

3. Thanks to Brent Callaghan for providing the basis for this script.

Back to: Managing NFS and NIS, 2nd Edition


O'Reilly Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies

© 2001, O'Reilly & Associates, Inc.
webmaster@oreilly.com