Debugging System Faults

Even if you’ve used all the monitoring and debugging techniques, sometimes bugs remain in the driver and the system faults when the driver is executed. When this happens it’s important to be able to collect as much information as possible to solve the problem.

Note that ``fault'' doesn’t mean ``panic.'' The Linux code is robust enough to respond gracefully to most errors: a fault usually results in the destruction of the current process, but the system goes on working. The system can panic, and it may if a fault happens outside of a process’s context, or if some vital part of the system is compromised. But when the problem is due to a driver error, it usually results only in the sudden destruction of the faulty process--the one using the driver. The only unrecoverable damage when a process is destroyed is that some memory allocated to the process’s context can be lost; for instance, dynamic lists allocated by the driver through kmalloc might be lost. However, since the kernel calls the close operation for any open device anyway, your driver can release what was allocated by the open method.

We’ve already said that when kernel code misbehaves, an informative message is printed on the console. The next section explains how to decode and use such messages. Even though they appear rather obscure to the novice, processor dumps are full of interesting information, often sufficient to pinpoint a program bug without the need for additional testing.

Oops Messages

Most bugs show themselves in NULL pointer dereferences or by the use of other incorrect pointer values. The usual outcome of such bugs is an oops message.

Any address used by the processor is a ``virtual'' address and is mapped to physical addresses through a complex structure of so-called page tables (see Section 13.1.1 in Chapter 13). When an invalid pointer is dereferenced, the paging mechanism fails to map the pointer to a physical address and the processor signals a ``page fault'' to the operating system. If the address is not valid, the kernel is not able to ``page-in'' the missing address; it generates an ``oops'' if this happens while the processor is in supervisor mode. It’s interesting to note that the way the kernel deals with faults changed in version 2.1, so that it can handle references to invalid addresses while in supervisor mode. The new implementation is described in the section Section 17.10, in Chapter 17.

An oops displays the processor status at the time of the fault, including the contents of the CPU registers, the location of page descriptor tables, and other seemingly incomprehensible information. The message is generated by printk statements in the fault handler (arch/*/kernel/traps.c) and is dispatched as described earlier in the section Section 4.1.1.

Let’s look at one such message. Here’s how an oops appears on a conventional personal computer (an x86 platform), running Linux 2.0 or newer--version 1.2 has a slightly different layout.

Unable to handle kernel paging request at virtual address c202e000
current->tss.cr3 = 012c0000, $r3 = 012c0000
*pde = 00001067
*pte = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<0202d079>]
EFLAGS: 00010216
eax: 00000041   ebx: 00001000   ecx: 0000004b   edx: 0156b018
esi: 0202e000   edi: 0800aed4   ebp: 01106f90   esp: 01106f6c
ds: 0018   es: 002b   fs: 002b   gs: 002b   ss: 0018
Process cat (pid: 597, process nr: 31, stackpage=01106000)
Stack: 02020018 0202d004 00f992e8 01257440 0800a000 00001000 01257440
       00001000 00f992e8 00120e7a 00f992e8 01257440 0800a000 00001000
       0800a000 00001000 bffffbd8 0010a602 00000003 0800a000 00001000
Call Trace: [<02020018>] [<0202d004>] [<00120e7a>] [<0010a602>]
Code: f3 a5 83 e3 03 89 d9 f3 a4 07 8b 45 14 8d 65 f4 5b 5e 5f 89

The message above was generated by running cat on a faulty module, built deliberately to demonstrate the error. faulty.c includes the following code:

char faulty_buf[1024];

read_write_t faulty_read (struct inode *inode, struct file *filp,
                          char *buf, count_t count)
{
  printk(KERN_DEBUG "read: inode %p, file %p, buf %p, count %li\n",
         inode, filp, buf, (long)count);
  memcpy_tofs(buf,faulty_buf,count);
  return count;
}

Since read copies data to user space from its small buffer (faulty_buf), we can expect reading the file in small pieces to work. Reading more than one kilobyte at a time, on the other hand, might cross a page boundary, and the read will fail if it accesses an invalid page. Indeed, the oops shown earlier happened during a read call asking for 4 kilobytes, as shown by the line appearing before the oops message in /var/log/messages (the default file where syslogd stores kernel messages):

read: inode 00f992e8, file 01257440, buf 0800a000, count 4096

The same cat command doesn’t generate an oops on an Alpha because reading 4 kilobytes from faulty_buf doesn’t cross the page boundary (pages are 8KB on the Alpha and the buffer sits near the beginning of the page). If reading faulty doesn’t generate an oops message on your system, try using wc (word count) instead, or specify an explicit block size to dd.

Using ksymoops

The main problem with oops messages is that the hex values are not meaningful for the programmer; they need to be resolved to symbols.

The kernel sources help the developer by including the ksymoops utility--but note that the program was missing from version 1.2 of the sources. The tool resolves numeric addresses found in the oops message to kernel symbols, but only for oops messages generated by PCs. Each architecture has its own message format, because the available information is processor-dependent.

ksymoops gets the oops message on standard input and the name of the kernel symbol table on the command line. The symbol table is usually /usr/src/ linux/System.map. The program prints the call trace and program code in a more readable format than the original oops message. The following snapshot was produced by feeding ksymoops the oops message shown in the previous section:

Trace: 2020018
Trace: 202d004
Trace: 120e7a <sys_read+8a/b0>
Trace: 10a602 <system_call+52/80>

Code: repz movsl %ds:(%esi),%es:(%edi)
Code: andl   $0x3,%ebx
Code: movl   %ebx,%ecx
Code: repz movsb %ds:(%esi),%es:(%edi)
Code: popl   %es
Code: movl   0x14(%ebp),%eax
Code: leal   0xfffffff4(%ebp),%esp
Code: popl   %ebx
Code: popl   %esi
Code: popl   %edi
Code: movl   %eax,(%eax)

The code disassembly produced by ksymoops shows the instruction that failed and the following ones. It’s apparent here--for those who know a little assembler--that the repz movsl instruction (REPeat till cx is Zero, MOVe a String of Longs) hit an unmapped page with the source index (esi, shown as 0x202e000). The command ksyms -m, used to retrieve module information, shows that the module is mapped to a single page at 0x0202d xxx, thus confirming that esi is out of range.

The decoded call trace still includes two numeric addresses, because the memory area occupied by the faulty module isn’t described in the system map. The values can be supplemented manually, either by inspecting the output of the ksyms command or by grepping for the module name in /proc/ksyms.

For this particular fault, however, the two addresses don’t correspond to code addresses. If you look in arch/i386/kernel/traps.c, you’ll find that the call trace is extracted from the whole stack dump by using some heuristics to look at the memory layout and distinguish between data values (local variables and function arguments) and return addresses. Only addresses that refer to kernel code and the ones that might refer to modules are shown in the call trace. Since module pages contain both code and data, extraneous stack frames can slip through the heuristics, and this is exactly what happened for the two 0x202 xxxx addresses above.

If you’d rather not look for module addresses by hand, the following pipeline can be used to create a new symbol table that encompasses both kernel and module symbols. Whenever you reload the module, you must recreate this symbol table.

cat /proc/ksyms /usr/src/linux/System.map | sed 's/ . / /' |\
    awk '{print $1,"T",$2}' | sort -u > /tmp/System.map

The pipeline combines the complete system map and the public kernel symbols from /proc/ksyms; the latter file lists module symbols in the current kernel, in addition to kernel symbols. Such addresses are shown as they appear after insmod has relocated the code. Since the two files have different formats, sed and awk are used to convert all the lines into a suitable format. The map is then sorted, removing duplicates, so that ksymoops can use it.

If we rerun ksymoops, it extracts the following information from the new symbol table:

>>EIP: 202d079 <faulty_read+45/60>
Trace: 2020018 <M_wacom_proto+1fb8/d8e4>
Trace: 202d004 <kmouse_wait+15d8/1608>
Trace: 120e7a <sys_read+8a/b0>
Trace: 10a602 <system_call+52/80>

Code: 202d079 <faulty_read+45/60> repz movsl %ds:(%esi),%es:(%edi)
Code: 202d07b <faulty_read+47/60> andl   $0x3,%ebx
Code: 202d07e <faulty_read+4a/60> movl   %ebx,%ecx
Code: 202d080 <faulty_read+4c/60> repz movsb %ds:(%esi),%es:(%edi)
Code: 202d082 <faulty_read+4e/60> popl   %es
Code: 202d083 <faulty_read+4f/60> movl   0x14(%ebp),%eax
Code: 202d086 <faulty_read+52/60> leal   0xfffffff4(%ebp),%esp
Code: 202d089 <faulty_read+55/60> popl   %ebx
Code: 202d08a <faulty_read+56/60> popl   %esi
Code: 202d08b <faulty_read+57/60> popl   %edi
Code: 202d08c <faulty_read+58/60> movl   %eax,(%eax)

As you can see, creating a modified system map is quite helpful when tracing oops messages related to modules: ksymoops can now decode the instruction pointer and the complete call trace. Note also that the format used to show disassembled code is the same as that used by the objdump program. objdump is a powerful utility; if you want to look at instructions before the one that failed, you can invoke the command objdump -d faulty.o.

In the resulting assembly listing of the file, the string faulty_read+45/60 marks the faulty line. For more information on objdump and its command-line options, see the man page for the command.

Even if you build your own modified symbol table, the concern mentioned above regarding the call trace applies: the 0x202 xxxx pointers have been decoded, but are still spurious.

Learning to decode an oops message requires some practice, but it’s worth doing. The time spent learning will be quickly repaid. The only issue is where to get the relevant documentation about assembly language, because the Unix syntax for machine instructions is different from the Intel syntax; even if you know PC assembly language, your experience has probably been with Intel-syntax programs.

Using oops

Using ksymoops is somewhat burdensome. You need the C++ compiler to build it, you must build your own symbol table to fully exploit the capabilities of the program, and you have to merge the original message and the ksymoops output to have all the information handy.

If you don’t want to go to all that trouble, you can use the oops program. oops is provided in the source files for this book on the O’Reilly FTP site. It is derived from the original ksymoops tool, which is no longer maintained by its author. oops is written in C and looks in /proc/ksyms without requiring the user to build a new symbol table every time a module is loaded.

The program tries to decode all the processor registers and the stack trace to symbolic values. Its disadvantage is that it is more verbose than ksymoops, but usually the more information you have available, the sooner you find the bug. Like the original ksyms, the program is able to decode oops messages generated on Intel platforms only. Porting to other platforms shouldn’t be difficult. The program is released under the GPL like the kernel sources.

Output generated by oops is similar to the ksymoops output, but more complete. Here is the beginning of its output for the oops shown above--I don’t think it’s worth showing the entire stack trace because the stack holds nothing interesting in this particular oops message:

EIP:    0010:0202d079 <faulty_read+45/60>
EFLAGS: 00010216
eax: 00000041
ebx: 00001000
ecx: 0000004b
edx: 0156b018
esi: 0202e000
edi: 0800aed4
ebp: 01106f90 <%esp+24>
esp: 01106f6c <%esp+0>
ds: 0018   es: 002b   fs: 002b   gs: 002b   ss: 0018
Process cat (pid: 597, process nr: 31, stackpage=01106000)
esp+00: 02020018 <M_wacom_proto+1fb8/d8e4>
esp+04: 0202d004
esp+08: 00f992e8

Having the registers and the stack decoded is helpful when you are debugging ``real'' modules (faulty is too short for the difference to be meaningful), and is particularly useful if all the symbols of the module being debugged are exported. It’s not unusual for processor registers to point to symbols in the module at the time of the fault, and you can identify them from the output only if the symbol table is exported to /proc/ksyms.

We can make a complete symbol table available by taking the following steps. First, we won’t declare static symbols in the module, since they wouldn’t be exported by insmod. And second, we can mask the call to register_symtab with #ifdef SCULL_DEBUG or a similar macro, as shown in the code below, extracted from scull’s init_module function.

#ifndef SCULL_DEBUG
    register_symtab(NULL); /* otherwise, leave global symbols visible */
#endif

We saw in Section 2.3.1 in Chapter 2, that if the module doesn’t register a symbol table, all the global symbols are exported. Although this feature is exploited only if SCULL_DEBUG is active, all global symbols should be correctly prefixed to avoid namespace pollution in the kernel (see Section 2.1 in Chapter 2).

Using klogd

Recent versions of the klogd daemon can decode oops messages before they reach the log files. Decoding is performed only by version 1.3 or newer of the daemon and only if -k /usr/src/linux/System.map is passed as a command-line option to the daemon. (You can replace System.map with another map file.)

A dump of the oops for faulty, produced by the new klogd and written to the system log, looks like this (note the decoded symbols in the stack trace):

EIP:    0010:[<0202d079>]
EFLAGS: 00010216
eax: 00000041   ebx: 00001000   ecx: 0000004b   edx: 00ee2414
esi: 0202e000   edi: 0800aed4   ebp: 0032ff90   esp: 0032ff6c
ds: 0018   es: 002b   fs: 002b   gs: 002b   ss: 0018
Process cat (pid: 861, process nr: 10, stackpage=0032f000)
Stack: 02020018 0202d004 00f992e8 01257c40 0800a000 00001000 01257c40
       00001000 00f992e8 00120e7a 00f992e8 01257c40 0800a000 00001000
       0800a000 00001000 bffffbd8 0010a602 00000003 0800a000 00001000
Call Trace: [<02020018>] [<0202d004>] [sys_read+138/176]
            [system_call+82/128]
Code: f3 a5 83 e3 03 89 d9 f3 a4 07 8b 45 14 8d 65 f4 5b 5e 5f 89

I consider the decoding klogd a great utility for the average Linux installation to help in debugging the kernel, but we’ll see how it is of less use for debugging modules. The message decoded by klogd includes most ksymoops features and doesn’t force the user to compile additional tools or merge two outputs in order to submit a complete bug report should something go wrong with the system. The daemon also correctly decodes the instruction pointer when the oops happens in the kernel proper. It doesn’t disassemble the program code, but this is not a problem when the message accompanies a bug report because the binary data is still there, and disassembled code can be generated offline.

Another great feature of the daemon is that it refuses to decode symbols if the symbol table doesn’t match the current kernel. If a symbol is decoded on the system log, you can be reasonably sure it is decoded correctly.

However, the tool, despite its usefulness for Linux users, is not that helpful when debugging modules. I personally don’t use the decoding options on the computers where I develop my software. The problem with klogd is that it doesn’t decode symbols in modules; even reading /proc/ksyms doesn’t help, because the daemon is run before the programmer loads the module. The presence of decoded symbols in the log file, then, confuses both oops and ksymoops, and it’s hard to perform additional decoding.

If you want to use klogd for debugging your modules, specific support is being added to the newest versions of the daemon, but as I’m writing this, it needs a small kernel patch to be effective.

System Hangs

Although most bugs in kernel code end up as oops messages, sometimes they can completely hang the system. If the system hangs, no message is printed. For example, if the code enters an endless loop, the kernel stops scheduling, and the system doesn’t respond to any action, including the magic Ctrl-Alt-Del combination. You have two choices to deal with system hangs--either you need to prevent them beforehand, or you need to be able to debug them after the fact.

You can prevent an endless loop by inserting schedule invocations at strategic points. The schedule call (as you might guess) invokes the scheduler and thus allows other processes to steal CPU time from the current process. If a process is looping in kernel space due to a bug in your driver, you will be able to kill the process, after tracing what is happening.

Inserting schedule calls in a driver function creates a new ``problem'' for the programmer: the function, and all the functions in its call trace, must be reentrant. Under normal circumstances, the driver as a whole is reentrant because different processes can access the device concurrently, but it’s not necessary for each function to be reentrant, because the Linux kernel is not preemptible. But if a driver function allows the scheduler to interrupt the current process, a different process might enter the same function. The reentrancy issue is not really important if the schedule calls are enabled only during debugging, because you can avoid accessing the driver from two concurrent processes if you know you aren’t allowed to. When blocking operations are introduced (in Section 5.2.2 in Chapter 5), the reentrancy problem will be dealt with in more detail.

To debug infinite loops, we can make use of the special functions of the Linux keyboard. By default, the PrScr key (keycode 70), if pressed with a modifier key, prints to the current console useful information about the machine’s status. This works on both x86 and Alpha systems. The Sparc port of Linux features the same capability, but uses the key marked ``Break/Scroll Lock'' (keycode 30).

Each of the special functions has a name and is associated with a keypress event, as shown in the following list. The function name appears in parentheses after the key combination.

Shift-PrScr (Show_Memory)

Prints several lines of information about memory usage, particularly the use of the buffer cache.

Control-PrScr (Show_State)

Prints one line for each process in the system, with information about the internal process tree. The current process is marked as such.

RightAlt-PrScr (Show_Registers)

This is the most important key when the system hangs, because it dumps the contents of the processor registers at the time the key is hit. Looking at the instruction pointer and how it changes over time can be extremely useful in understanding where the code is looping, provided there exists a system map for the current kernel.

The name of each function can be passed to loadkeys in order to remap the binding to a different key. The keyboard map can be modified at will (it is ``policy-free'').

The messages printed by these functions appear on the console if console_loglevel is high enough. The default level should be high enough, unless you run an old klogd and a new kernel. If the messages don’t appear, you can raise the loglevel as explained earlier. The definition of ``high enough'' depends on the kernel version you use. It’s 5 for Linux 2.0 and later.

It’s important to be sure the loglevel is high enough, because the messages will display on the console even when the computer is hung. The messages are generated at the time of the interrupt and therefore can slip through even if a faulty process is executing a tight loop without releasing the CPU--that is, unless interrupts are disabled, which is both unlikely and unlucky.

Sometimes the system may appear to be hung, but it isn’t. This can happen, for example, if the keyboard remains locked in some strange way. These false hangs can be detected by looking at the output of a program you keep running for just this purpose. I have a program that updates the clock on an LED display, and I discovered that the program is also useful as evidence that the scheduler is still working. You can check the scheduler without using external hardware, by implementing a program that flashes the keyboard LEDs, turns on the floppy motor every now and then, or ticks the speaker--conventional beeps are quite annoying and should be avoided, in my opinion. Look for the KDMKTONE ioctl command instead. A sample program (misc-progs/heartbeat.c) that flashes a keyboard LED in a heartbeat fashion is available in the sources on the O’Reilly FTP site.

If the keyboard isn’t accepting input, the best thing to do is log into the system through your network in order to kill any offending processes, or reset the keyboard (with kbd_mode -a). However, discovering that the hang is only a keyboard lockup is of little use if you don’t have a network available to help you recover. If this is the case, you should set up alternative input devices to be able at least to reboot the system cleanly. A shutdown and reboot cycle is easier on your computer than hitting the so-called ``big red button,'' and it saves you from the lengthy fsck scanning of your filesystems.

Such alternative input devices can be a joystick or the mouse. There is a joystick-reboot daemon on sunsite.unc.edu, and the gpm-1.10 or newer mouse server features a command-line option to enable a similar capability. If the keyboard is erroneously in ``raw'' mode instead of being locked, you can resort to the tricks described in the documentation of the kbd package. I suggest that you read the documentation before the problem arises and it’s too late. Another possibility is to configure the gpm-root menus to have a ``reboot'' or ``reset keyboard'' entry; gpm-root is a daemon that responds to control-mouse events in order to draw menus on the screen and perform configurable actions.

Finally, you can hit the ``Secure Attention Key'' (SAK), a special key meant to recover the system to a usable state. The current Linux versions don’t have an entry for the key in the default keyboard map because the implementation is not guaranteed always to succeed. You can nonetheless map SAK to your keyboard by using loadkeys. You should also look at the implementation of SAK in the drivers/char directory. The comments in the code explain why the key doesn’t always work with Linux 2.0, so I won’t say any more about it.

If you run 2.1.9 or newer, on the other hand, you’ll enjoy having a reliable Secure Attention Key. Moreover, 2.1.43 and newer kernels have a compile-time option to enable a ``Magic System Request Key''; I urge you to look in drivers/char/sysrq.c and enjoy the new technology.

If your driver really hangs the system, and you don’t know where to insert schedule calls, the best way to go is to add some print messages and write them to the console (by changing the console_loglevel value). It’s also wise to mount all your disks read-only (or unmount them) before reproducing the hang. If the disks are read-only or unmounted, there’s no risk of damaging the filesystem or leaving it in an inconsistent state. At least you’ll avoid the fsck pass after resetting the computer. Another possibility is using an NFS-root computer to test new modules. In this case you’ll avoid any filesystem corruption, as filesystem coherence is managed by the NFS server, which is not brought down by your device driver.

Get Linux Device Drivers now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.