Debugging System Faults

Even if you’ve used all the monitoring and debugging techniques, sometimes bugs remain in the driver, and the system faults when the driver is executed. When this happens it’s important to be able to collect as much information as possible to solve the problem.

Note that “fault” doesn’t mean “panic.” The Linux code is robust enough to respond gracefully to most errors: a fault usually results in the destruction of the current process while the system goes on working. The system can panic, and it may if a fault happens outside of a process’s context, or if some vital part of the system is compromised. But when the problem is due to a driver error, it usually results only in the sudden death of the process unlucky enough to be using the driver. The only unrecoverable damage when a process is destroyed is that some memory allocated to the process’s context is lost; for instance, dynamic lists allocated by the driver through kmalloc might be lost. However, since the kernel calls the close operation for any open device when a process dies, your driver can release what was allocated by the open method.

We’ve already said that when kernel code misbehaves, an informative message is printed on the console. The next section explains how to decode and use such messages. Even though they appear rather obscure to the novice, processor dumps are full of interesting information, often sufficient to pinpoint a program bug without the need for additional testing.

Oops Messages

Most bugs show themselves in NULL pointer dereferences or by the use of other incorrect pointer values. The usual outcome of such bugs is an oops message.

Any address used by the processor is a virtual address and is mapped to physical addresses through a complex structure of so-called page tables (see Section 13.1.4 in Chapter 13). When an invalid pointer is dereferenced, the paging mechanism fails to map the pointer to a physical address and the processor signals a page fault to the operating system. If the address is not valid, the kernel is not able to “page in” the missing address; it generates an oops if this happens while the processor is in supervisor mode.

It’s worth noting that the first enhancement introduced after version 2.0 was automatic handling of invalid address faults when moving data to and from user space. Linus chose to let the hardware catch erroneous memory references, so that the normal case (where the addresses are correct) is handled more efficiently.

An oops displays the processor status at the time of the fault, including the contents of the CPU registers, the location of page descriptor tables, and other seemingly incomprehensible information. The message is generated by printk statements in the fault handler (arch/*/kernel/traps.c) and is dispatched as described earlier, in Section 4.1.1.

Let’s look at one such message. Here’s what results from dereferencing a NULL pointer on a PC running version 2.4 of the kernel. The most relevant information here is the instruction pointer (EIP), the address of the faulty instruction.

Unable to handle kernel NULL pointer dereference at virtual address \
     00000000
 printing eip:
c48370c3
*pde = 00000000
Oops: 0002
CPU:    0
EIP:    0010:[<c48370c3>]
EFLAGS: 00010286
eax: ffffffea   ebx: c2281a20   ecx: c48370c0   edx: c2281a40
esi: 4000c000   edi: 4000c000   ebp: c38adf8c   esp: c38adf8c
ds: 0018   es: 0018   ss: 0018
Process ls (pid: 23171, stackpage=c38ad000)
Stack: 0000010e c01356e6 c2281a20 4000c000 0000010e c2281a40 c38ac000 \
            0000010e 
       4000c000 bffffc1c 00000000 00000000 c38adfc4 c010b860 00000001 \
            4000c000 
       0000010e 0000010e 4000c000 bffffc1c 00000004 0000002b 0000002b \
            00000004 
Call Trace: [<c01356e6>] [<c010b860>] 
Code: c7 05 00 00 00 00 00 00 00 00 31 c0 89 ec 5d c3 8d b6 00 00

This message was generated by writing to a device owned by the faulty module, a module built deliberately to demonstrate failures. The implementation of the write method of faulty.c is trivial:

ssize_t faulty_write (struct file *filp, const char *buf, size_t count,
		loff_t *pos)
{
    /* make a simple fault by dereferencing a NULL pointer */
    *(int *)0 = 0;
    return 0;
}

As you can see, what we do here is dereference a NULL pointer. Since 0 is never a valid pointer value, a fault occurs, which the kernel turns into the oops message shown earlier. The calling process is then killed.

The faulty module has more interesting fault conditions in its read implementation:

char faulty_buf[1024];

ssize_t faulty_read (struct file *filp, char *buf, size_t count, 
                     loff_t *pos)
{
    int ret, ret2;
    char stack_buf[4];

    printk(KERN_DEBUG "read: buf %p, count %li\n", buf, (long)count);
    /* the next line oopses with 2.0, but not with 2.2 and later */
    ret = copy_to_user(buf, faulty_buf, count);
    if (!ret) return count; /* we survived */

    printk(KERN_DEBUG "didn't fail: retry\n");
    /* For 2.2 and 2.4, let's try a buffer overflow  */
    sprintf(stack_buf, "1234567\n");
    if (count > 8) count = 8; /* copy 8 bytes to the user */
    ret2 = copy_to_user(buf, stack_buf, count);
    if (!ret2) return count;
    return ret2;
}

It first reads from a global buffer without checking the size of the data, and then performs a buffer overrun by writing to a local buffer. The first situation results in an oops only in version 2.0 of the kernel, because later versions automatically deal with user copy functions. The buffer overflow results in an oops with all kernel versions; however, since the return instruction brings the instruction pointer to nowhere land, this kind of fault is much harder to trace, and you can get something like the following:

EIP:    0010:[<00000000>]
[...]
Call Trace: [<c010b860>] 
Code:  Bad EIP value.

The main problem with users dealing with oops messages is in the little intrinsic meaning carried by hexadecimal values; to be meaningful to the programmer they need to be resolved to symbols. A couple of utilities are available to perform this resolution for developers: klogd and ksymoops. The former tool performs symbol decoding by itself whenever it is running; the latter needs to be purposely invoked by the user. In the following discussion we use the data generated in our first oops example by dereferencing a NULL pointer.

Using klogd

The klogd daemon can decode oops messages before they reach the log files. In many situations, klogd can provide all the information a developer needs to track down a problem, though sometimes the developer must give it a little help.

A dump of the oops for faulty, as it reaches the system log, looks like this (note the decoded symbols on the EIP line and in the stack trace):

Unable to handle kernel NULL pointer dereference at virtual address \
     00000000 
 printing eip: 
c48370c3 
*pde = 00000000 
Oops: 0002 
CPU:    0 
EIP:    0010:[faulty:faulty_write+3/576] 
EFLAGS: 00010286 
eax: ffffffea   ebx: c2c55ae0   ecx: c48370c0   edx: c2c55b00 
esi: 0804d038   edi: 0804d038   ebp: c2337f8c   esp: c2337f8c 
ds: 0018   es: 0018   ss: 0018 
Process cat (pid: 23413, stackpage=c2337000) 
Stack: 00000001 c01356e6 c2c55ae0 0804d038 00000001 c2c55b00 c2336000 \
            00000001
       0804d038 bffffbd4 00000000 00000000 bffffbd4 c010b860 00000001 \
            0804d038
       00000001 00000001 0804d038 bffffbd4 00000004 0000002b 0000002b \
            00000004
Call Trace: [sys_write+214/256] [system_call+52/56]  
Code: c7 05 00 00 00 00 00 00 00 00 31 c0 89 ec 5d c3 8d b6 00 00

klogd provides most of the necessary information to track down the problem. In this case we see that the instruction pointer (EIP) was executing in the function faulty_write, so we know where to start looking. The 3/576 string tells us that the processor was at byte 3 of a function that appears to be 576 bytes long. Note that the values are decimal, not hex.

The developer must exercise some care, however, to get useful information for errors that occur within loadable modules. klogd loads all of the available symbol information when it starts, and uses those symbols thereafter. If you load a module after klogd has initialized itself (usually at system boot), klogd will not have your module’s symbol information. To force klogd to go out and get that information, send the klogd process a SIGUSR1 signal after your module has been loaded (or reloaded), and before you do anything that could cause it to oops.

It is also possible to run klogd with the -p (“paranoid”) option, which will cause it to reread symbol information anytime it sees an oops message. The klogd manpage recommends against this mode of operation, however, since it makes klogd query the kernel for information after the problem has occurred. Information obtained after an error could be plain wrong.

For klogd to work properly, it must have a current copy of the System.map symbol table file. Normally this file is found in /boot; if you have built and installed a kernel from a nonstandard location you may have to copy System.map into /boot, or tell klogd to look elsewhere. klogd refuses to decode symbols if the symbol table doesn’t match the current kernel. If a symbol is decoded on the system log, you can be reasonably sure it is decoded correctly.

Using ksymoops

At times klogd may not be enough for your tracing purposes. Usually, you need to get both the hexadecimal address and the associated symbol, and you often need offsets printed as hex numbers. You may need more information than address decoding. Also, it is common for klogd to get killed during the fault. In such situations, a stronger oops analyzer may be called for; ksymoops is such a tool.

Prior to the 2.3 development series, ksymoops was distributed with the kernel source, in the scripts directory. It now lives on its own FTP site and is maintained independently of the kernel. Even if you are working with an older kernel, you probably should go to ftp://ftp.ocs.com.au/pub/ksymoops and get an updated version of the tool.

To operate at its best, ksymoops needs a lot of information in addition to the error message; you can use command-line options to tell it where to find the various items. The program needs the following items:

A System.map file

This map must correspond to the kernel that was running at the time the oops occurred. The default is /usr/src/linux/System.map.

A list of modules

ksymoops needs to know what modules were loaded when the oops occurred, in order to extract symbolic information from them. If you do not supply this list, ksymoops will look at /proc/modules.

A list of kernel symbols defined when the oops occurred

The default is to get this list from /proc/ksyms.

A copy of the kernel image that was running

Note that ksymoops needs a straight kernel image, not the compressed version (vmlinuz, zImage, or bzImage) that most systems boot. The default is to use no kernel image because most people don’t keep it. If you have the exact image handy, you should tell the program where it is by using the -v option.

The locations of the object files for any kernel modules that were loaded

ksymoops will look in the standard directories for modules, but during development you will almost certainly have to tell it where your module lives using the -o option

Although ksymoops will go to files in /proc for some of its needed information, the results can be unreliable. The system, of course, will almost certainly have been rebooted between the time the oops occurs and when ksymoops is run, and the information from /proc may not match the state of affairs when the failure occurred. When possible, it is better to save copies of /proc/modules and /proc/ksyms prior to causing the oops to happen.

We urge driver developers to read the manual page for ksymoops because it is a very informative document.

The last argument on the tool’s command line is the location of the oops message; if it is missing, the tool will read stdin in the best Unix tradition. The message can be recovered from the system logs with luck; in the case of a very bad crash you may end up writing it down off the screen and typing it back in (unless you were using a serial console, a nice tool for kernel developers).

Note that ksymoops will be confused by an oops message that has already been processed by klogd. If you are running klogd, and your system is still running after an oops occurs, a clean oops message can often be obtained by invoking the dmesg command.

If you do not provide all of the listed information explicitly, ksymoops will issue warnings. It will also issue warnings about things like loaded modules that define no symbols. A warning-free run of ksymoops is rare.

Output from ksymoops tends to look like the following:

>>EIP; c48370c3 <[faulty]faulty_write+3/20>   <=====
Trace; c01356e6 <sys_write+d6/100>
Trace; c010b860 <system_call+34/38>
Code;  c48370c3 <[faulty]faulty_write+3/20>
00000000 <_EIP>:
Code;  c48370c3 <[faulty]faulty_write+3/20>   <=====
   0:   c7 05 00 00 00    movl   $0x0,0x0   <=====
Code;  c48370c8 <[faulty]faulty_write+8/20>
   5:   00 00 00 00 00 
Code;  c48370cd <[faulty]faulty_write+d/20>
   a:   31 c0             xorl   %eax,%eax
Code;  c48370cf <[faulty]faulty_write+f/20>
   c:   89 ec             movl   %ebp,%esp
Code;  c48370d1 <[faulty]faulty_write+11/20>
   e:   5d                popl   %ebp
Code;  c48370d2 <[faulty]faulty_write+12/20>
   f:   c3                ret    
Code;  c48370d3 <[faulty]faulty_write+13/20>
  10:   8d b6 00 00 00    leal   0x0(%esi),%esi
Code;  c48370d8 <[faulty]faulty_write+18/20>
  15:   00

As you can see, ksymoops provides EIP and kernel stack information much like klogd does, but more precisely and in hexadecimal. You’ll note that the faulty_write function is correctly reported to be 0x20 bytes long. This is because ksymoops reads the object file of your module and extracts all available information.

In this case, moreover, you also get an assembly language dump of the code where the fault occurred. This information can often be used to figure out exactly what was happening; here it’s clearly an instruction that writes a 0 to address 0.

One interesting feature of ksymoops is that it is ported to nearly all the platforms where Linux runs and exploits the bfd (binary format description) library in order to support several computer architectures at the same time. To step outside of the PC world, let’s see how the same oops message appears on the SPARC64 platform (several lines have been broken for typographical needs):

Unable to handle kernel NULL pointer dereference
tsk->mm->context = 0000000000000734
tsk->mm->pgd = fffff80003499000
              \|/ ____ \|/
              "@'/ .. \`@"
              /_| \__/ |_\
                 \__U_/
ls(16740): Oops
TSTATE: 0000004400009601 TPC: 0000000001000128 TNPC: 0000000000457fbc \
Y: 00800000
g0: 000000007002ea88 g1: 0000000000000004 g2: 0000000070029fb0 \
g3: 0000000000000018
g4: fffff80000000000 g5: 0000000000000001 g6: fffff8000119c000 \
g7: 0000000000000001
o0: 0000000000000000 o1: 000000007001a000 o2: 0000000000000178 \
o3: fffff8001224f168
o4: 0000000001000120 o5: 0000000000000000 sp: fffff8000119f621 \
ret_pc: 0000000000457fb4
l0: fffff800122376c0 l1: ffffffffffffffea l2: 000000000002c400 \
l3: 000000000002c400
l4: 0000000000000000 l5: 0000000000000000 l6: 0000000000019c00 \
l7: 0000000070028cbc
i0: fffff8001224f140 i1: 000000007001a000 i2: 0000000000000178 \
i3: 000000000002c400
i4: 000000000002c400 i5: 000000000002c000 i6: fffff8000119f6e1 \
i7: 0000000000410114
Caller[0000000000410114]
Caller[000000007007cba4]
Instruction DUMP: 01000000 90102000 81c3e008 <c0202000> \
30680005 01000000 01000000 01000000 01000000

Note how the instruction dump doesn’t start from the instruction that caused the fault but three instructions earlier: that’s because the RISC platforms execute several instructions in parallel and may generate deferred exceptions, so one must be able to look back at the last few instructions.

This is what ksymoops prints when fed with input data starting at the TSTATE line:

>>TPC; 0000000001000128 <[faulty].text.start+88/a0>   <=====
>>O7;  0000000000457fb4 <sys_write+114/160>
>>I7;  0000000000410114 <linux_sparc_syscall+34/40>
Trace; 0000000000410114 <linux_sparc_syscall+34/40>
Trace; 000000007007cba4 <END_OF_CODE+6f07c40d/????>
Code;  000000000100011c <[faulty].text.start+7c/a0>
0000000000000000 <_TPC>:
Code;  000000000100011c <[faulty].text.start+7c/a0>
   0:   01 00 00 00       nop 
Code;  0000000001000120 <[faulty].text.start+80/a0>
   4:   90 10 20 00       clr  %o0     ! 0 <_TPC>
Code;  0000000001000124 <[faulty].text.start+84/a0>
   8:   81 c3 e0 08       retl 
Code;  0000000001000128 <[faulty].text.start+88/a0>   <=====
   c:   c0 20 20 00       clr  [ %g0 ]   <=====
Code;  000000000100012c <[faulty].text.start+8c/a0>
  10:   30 68 00 05       b,a   %xcc, 24 <_TPC+0x24> \
                        0000000001000140 <[faulty]faulty_write+0/20>
Code;  0000000001000130 <[faulty].text.start+90/a0>
  14:   01 00 00 00       nop 
Code;  0000000001000134 <[faulty].text.start+94/a0>
  18:   01 00 00 00       nop 
Code;  0000000001000138 <[faulty].text.start+98/a0>
  1c:   01 00 00 00       nop 
Code;  000000000100013c <[faulty].text.start+9c/a0>
  20:   01 00 00 00       nop

To print the disassembled code shown we had to tell ksymoops the target file format and architecture (this is needed because the native architecture for SPARC64 user space is 32 bit). In this case, the options -t elf64-sparc -a sparc:v9 did the job.

You may complain that this call trace doesn’t carry any interesting information; however, the SPARC processors don’t save all the call trace on the stack: the O7 and I7 registers hold the instruction pointers of the last two calling functions, which is why they are shown near the call trace. In this case, the faulty instruction was in a function invoked by sys_write.

Note that, whatever the platform/architecture pair, the format used to show disassembled code is the same as that used by the objdump program. objdump is a powerful utility; if you want to look at the whole function that failed, you can invoke the command objdump -d faulty.o (once again, on SPARC64, you need special options: -- target elf64-sparc—architecture sparc:v9). For more information on objdump and its command-line options, see the manpage for the command.

Learning to decode an oops message requires some practice and an understanding of the target processor you are using, as well as of the conventions used to represent assembly language, but it’s worth doing. The time spent learning will be quickly repaid. Even if you have previous expertise with the PC assembly language under non-Unix operating systems, you may need to devote some time to learning, because the Unix syntax is different from Intel syntax. (A good description of the differences is in the Info documentation file for as, in the chapter called “i386-specific.”)

System Hangs

Although most bugs in kernel code end up as oops messages, sometimes they can completely hang the system. If the system hangs, no message is printed. For example, if the code enters an endless loop, the kernel stops scheduling, and the system doesn’t respond to any action, including the magic CTRL-ALT-DEL combination. You have two choices for dealing with system hangs—either prevent them beforehand or be able to debug them after the fact.

You can prevent an endless loop by inserting schedule invocations at strategic points. The schedule call (as you might guess) invokes the scheduler and thus allows other processes to steal CPU time from the current process. If a process is looping in kernel space due to a bug in your driver, the schedule calls enable you to kill the process, after tracing what is happening.

You should be aware, of course, that any call to schedule may create an additional source of reentrant calls to your driver, since it allows other processes to run. This reentrancy should not normally be a problem, assuming that you have used suitable locking in your driver. Be sure, however, not to call schedule any time that your driver is holding a spinlock.

If your driver really hangs the system, and you don’t know where to insert schedule calls, the best way to go is to add some print messages and write them to the console (by changing the console_loglevel value).

Sometimes the system may appear to be hung, but it isn’t. This can happen, for example, if the keyboard remains locked in some strange way. These false hangs can be detected by looking at the output of a program you keep running for just this purpose. A clock or system load meter on your display is a good status monitor; as long as it continues to update, the scheduler is working. If you are not using a graphic display, you can check the scheduler by running a program that flashes the keyboard LEDs, turns on the floppy motor every now and then, or ticks the speaker—conventional beeps are quite annoying and should be avoided; look for the KDMKTONE ioctl command instead. A sample program (misc-progs/heartbeat.c) that flashes a keyboard LED in a heartbeat fashion is available in the sources on the O’Reilly FTP site.

If the keyboard isn’t accepting input, the best thing to do is log into the system through your network and kill any offending processes, or reset the keyboard (with kbd_mode -a). However, discovering that the hang is only a keyboard lockup is of little use if you don’t have a network available to help you recover. If this is the case, you could set up alternative input devices to be able at least to reboot the system cleanly. A shutdown and reboot cycle is easier on your computer than hitting the so-called big red button, and it saves you from the lengthy fsck scanning of your disks.

Such an alternative input device can be, for example, the mouse. Version 1.10 or newer of the gpm mouse server features a command-line option to enable a similar capability, but it works only in text mode. If you don’t have a network connection and run in graphics mode, we suggest running some custom solution, like a switch connected to the DCD pin of the serial line and a script that polls for status change.

An indispensable tool for these situations is the “magic SysRq key,” which is available on more architectures in 2.2 and later kernels. Magic SysRq is invoked with the combination of the ALT and SysRq keys on the PC keyboard, or with the ALT and Stop keys on SPARC keyboards. A third key, pressed along with these two, performs one of a number of useful actions, as follows:

r

Turns off keyboard raw mode in situations where you cannot run kbd_mode.

k

Invokes the “secure attention” (SAK) function. SAK will kill all processes running on the current console, leaving you with a clean terminal.

s

Performs an emergency synchronization of all disks.

u

Attempts to remount all disks in a read-only mode. This operation, usually invoked immediately after s, can save a lot of filesystem checking time in cases where the system is in serious trouble.

b

Immediately reboots the system. Be sure to synchronize and remount the disks first.

p

Prints the current register information.

t

Prints the current task list.

m

Prints memory information.

Other magic SysRq functions exist; see sysrq.txt in the Documentation directory of the kernel source for the full list. Note that magic SysRq must be explicitly enabled in the kernel configuration, and that most distributions do not enable it, for obvious security reasons. For a system used to develop drivers, however, enabling magic SysRq is worth the trouble of building a new kernel in itself. Magic SysRq must be enabled at runtime with a command like the following:

echo 1 > /proc/sys/kernel/sysrq

Another precaution to use when reproducing system hangs is to mount all your disks as read-only (or unmount them). If the disks are read-only or unmounted, there’s no risk of damaging the filesystem or leaving it in an inconsistent state. Another possibility is using a computer that mounts all of its filesystems via NFS, the network file system. The “NFS-Root” capability must be enabled in the kernel, and special parameters must be passed at boot time. In this case you’ll avoid any filesystem corruption without even resorting to SysRq, because filesystem coherence is managed by the NFS server, which is not brought down by your device driver.

Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.