Debugging by Querying

The previous section described how printk works and how it can be used. What it didn’t talk about are its disadvantages.

A massive use of printk can slow down the system noticeably, because syslogd keeps syncing its output files; thus, every line that is printed causes a disk operation. This is the right implementation from syslogd’s perspective. It tries to write everything to disk in case the system crashes right after printing the message; however, you don’t want to slow down your system just for the sake of debugging messages. This problem can be solved by prefixing the name of your log file as it appears in /etc/syslogd.conf with a minus.[22] The problem with changing the configuration file is that the modification will likely remain there after you are done debugging, even though during normal system operation you do want messages to be flushed to disk as soon as possible. An alternative to such a permanent change is running a program other than klogd (such as cat /proc/kmsg, as suggested earlier), but this may not provide a suitable environment for normal system operation.

More often than not, the best way to get relevant information is to query the system when you need the information, instead of continually producing data. In fact, every Unix system provides many tools for obtaining system information: ps, netstat, vmstat, and so on.

Two main techniques are available to driver developers for querying the system: creating a file in the /proc filesystem and using the ioctl driver method. You may use devfs as an alternative to /proc, but /proc is an easier tool to use for information retrieval.

Using the /proc Filesystem

The /proc filesystem is a special, software-created filesystem that is used by the kernel to export information to the world. Each file under /proc is tied to a kernel function that generates the file’s “contents” on the fly when the file is read. We have already seen some of these files in action; /proc/modules, for example, always returns a list of the currently loaded modules.

/proc is heavily used in the Linux system. Many utilities on a modern Linux distribution, such as ps, top, and uptime, get their information from /proc. Some device drivers also export information via /proc, and yours can do so as well. The /proc filesystem is dynamic, so your module can add or remove entries at any time.

Fully featured /proc entries can be complicated beasts; among other things, they can be written to as well as read from. Most of the time, however, /proc entries are read-only files. This section will concern itself with the simple read-only case. Those who are interested in implementing something more complicated can look here for the basics; the kernel source may then be consulted for the full picture.

All modules that work with /proc should include <linux/proc_fs.h> to define the proper functions.

To create a read-only /proc file, your driver must implement a function to produce the data when the file is read. When some process reads the file (using the read system call), the request will reach your module by means of one of two different interfaces, according to what you registered. We’ll leave registration for later in this section and jump directly to the description of the reading interfaces.

In both cases the kernel allocates a page of memory (i.e., PAGE_SIZE bytes) where the driver can write data to be returned to user space.

The recommended interface is read_proc, but an older interface named get_info also exists.

int (*read_proc)(char *page, char **start, off_t offset, int count, int *eof, void *data);

The page pointer is the buffer where you’ll write your data; start is used by the function to say where the interesting data has been written in page (more on this later); offset and count have the same meaning as in the read implementation. The eof argument points to an integer that must be set by the driver to signal that it has no more data to return, while data is a driver-specific data pointer you can use for internal bookkeeping.[23] The function is available in version 2.4 of the kernel, and 2.2 as well if you use our sysdep.h header.

int (*get_info)(char *page, char **start, off_t offset, int count);

get_info is an older interface used to read from a /proc file. The arguments all have the same meaning as for read_proc. What it lacks is the pointer to report end-of-file and the object-oriented flavor brought in by the data pointer. The function is available in all the kernel versions we are interested in (although it had an extra unused argument in its 2.0 implementation).

Both functions should return the number of bytes of data actually placed in the page buffer, just like the read implementation does for other files. Other output values are *eof and *start. eof is a simple flag, but the use of the start value is somewhat more complicated.

The main problem with the original implementation of user extensions to the /proc filesystem was use of a single memory page for data transfer. This limited the total size of a user file to 4 KB (or whatever was appropriate for the host platform). The start argument is there to implement large data files, but it can be ignored.

If your proc_read function does not set the *start pointer (it starts out NULL), the kernel assumes that the offset parameter has been ignored and that the data page contains the whole file you want to return to user space. If, on the other hand, you need to build a bigger file from pieces, you can set *start to be equal to page so that the caller knows your new data is placed at the beginning of the buffer. You should then, of course, skip the first offset bytes of data, which will have already been returned in a previous call.

There has long been another major issue with /proc files, which start is meant to solve as well. Sometimes the ASCII representation of kernel data structures changes between successive calls to read, so the reader process could find inconsistent data from one call to the next. If *start is set to a small integer value, the caller will use it to increment filp->f_pos independently of the amount of data you return, thus making f_pos an internal record number of your read_proc or get_info procedure. If, for example, your read_proc function is returning information from a big array of structures, and five of those structures were returned in the first call, start could be set to 5. The next call will provide that same value as the offset; the driver then knows to start returning data from the sixth structure in the array. This is defined as a “hack” by its authors and can be seen in fs/proc/generic.c.

Time for an example. Here is a simple read_proc implementation for the scull device:

int scull_read_procmem(char *buf, char **start, off_t offset,
                   int count, int *eof, void *data)
{
    int i, j, len = 0;
    int limit = count - 80; /* Don't print more than this */

    for (i = 0; i < scull_nr_devs && len <= limit; i++) {
        Scull_Dev *d = &scull_devices[i];
        if (down_interruptible(&d->sem))
                return -ERESTARTSYS;
        len += sprintf(buf+len,"\nDevice %i: qset %i, q %i, sz %li\n",
                       i, d->qset, d->quantum, d->size);
        for (; d && len <= limit; d = d->next) { /* scan the list */
            len += sprintf(buf+len, "  item at %p, qset at %p\n", d, 
                                    d->data);
            if (d->data && !d->next) /* dump only the last item 
                                                    - save space */
                for (j = 0; j < d->qset; j++) {
                    if (d->data[j])
                        len += sprintf(buf+len,"    % 4i: %8p\n",
                                                    j,d->data[j]);
                }
        }
        up(&scull_devices[i].sem);
    }
    *eof = 1;
    return len;
}

This is a fairly typical read_proc implementation. It assumes that there will never be a need to generate more than one page of data, and so ignores the start and offset values. It is, however, careful not to overrun its buffer, just in case.

A /proc function using the get_info interface would look very similar to the one just shown, with the exception that the last two arguments would be missing. The end-of-file condition, in this case, is signaled by returning less data than the caller expects (i.e., less than count).

Once you have a read_proc function defined, you need to connect it to an entry in the /proc hierarchy. There are two ways of setting up this connection, depending on what versions of the kernel you wish to support. The easiest method, only available in the 2.4 kernel (and 2.2 too if you use our sysdep.h header), is to simply call create_proc_read_entry. Here is the call used by scull to make its /proc function available as /proc/scullmem:

create_proc_read_entry("scullmem", 
                       0    /* default mode */,
                       NULL /* parent dir */, 
                       scull_read_procmem, 
                       NULL /* client data */);

The arguments to this function are, as shown, the name of the /proc entry, the file permissions to apply to the entry (the value 0 is treated as a special case and is turned to a default, world-readable mask), the proc_dir_entry pointer to the parent directory for this file (we use NULL to make the driver appear directly under /proc), the pointer to the read_proc function, and the data pointer that will be passed back to the read_proc function.

The directory entry pointer can be used to create entire directory hierarchies under /proc. Note, however, that an entry may be more easily placed in a subdirectory of /proc simply by giving the directory name as part of the name of the entry—as long as the directory itself already exists. For example, an emerging convention says that /proc entries associated with device drivers should go in the subdirectory driver/; scull could place its entry there simply by giving its name as driver/scullmem.

Entries in /proc, of course, should be removed when the module is unloaded. remove_proc_entry is the function that undoes what create_proc_read_entry did:

 remove_proc_entry("scullmem", NULL /* parent dir */);

The alternative method for creating a /proc entry is to create and initialize a proc_dir_entry structure and pass it to proc_register_dynamic (version 2.0) or proc_register (version 2.2, which assumes a dynamic file if the inode number in the structure is 0). As an example, consider the following code that scull uses when compiled against 2.0 headers:

static int scull_get_info(char *buf, char **start, off_t offset,
                int len, int unused)
{
    int eof = 0;
    return scull_read_procmem (buf, start, offset, len, &eof, NULL);
}

struct proc_dir_entry scull_proc_entry = {
        namelen:    8,
        name:       "scullmem",
        mode:       S_IFREG | S_IRUGO,
        nlink:      1,
        get_info:   scull_get_info,
};

static void scull_create_proc()
{
    proc_register_dynamic(&proc_root, &scull_proc_entry);
}

static void scull_remove_proc()
{
    proc_unregister(&proc_root, scull_proc_entry.low_ino);
}

The code declares a function using the get_info interface and fills in a proc_dir_entry structure that is registered with the filesystem.

This code provides compatibility across the 2.0 and 2.2 kernels, with a little support from macro definitions in sysdep.h. It uses the get_info interface because the 2.0 kernel did not support read_proc. Some more work with #ifdef could have made it use read_proc with Linux 2.2, but the benefits would be minor.

The ioctl Method

ioctl, which we show you how to use in the next chapter, is a system call that acts on a file descriptor; it receives a number that identifies a command to be performed and (optionally) another argument, usually a pointer.

As an alternative to using the /proc filesystem, you can implement a few ioctl commands tailored for debugging. These commands can copy relevant data structures from the driver to user space, where you can examine them.

Using ioctl this way to get information is somewhat more difficult than using /proc, because you need another program to issue the ioctl and display the results. This program must be written, compiled, and kept in sync with the module you’re testing. On the other hand, the driver’s code is easier than what is needed to implement a /proc file

There are times when ioctl is the best way to get information, because it runs faster than reading /proc. If some work must be performed on the data before it’s written to the screen, retrieving the data in binary form is more efficient than reading a text file. In addition, ioctl doesn’t require splitting data into fragments smaller than a page.

Another interesting advantage of the ioctl approach is that information-retrieval commands can be left in the driver even when debugging would otherwise be disabled. Unlike a /proc file, which is visible to anyone who looks in the directory (and too many people are likely to wonder “what that strange file is”), undocumented ioctl commands are likely to remain unnoticed. In addition, they will still be there should something weird happen to the driver. The only drawback is that the module will be slightly bigger.



[22] The minus is a “magic” marker to prevent syslogd from flushing the file to disk at every new message, documented in syslog.conf(5), a manual page worth reading.

[23] We’ll find several of these pointers throughout the book; they represent the “object” involved in this action and correspond somewhat to this in C++.

Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.