The kiobuf Interface

As of version 2.3.12, the Linux kernel supports an I/O abstraction called the kernel I/O buffer, or kiobuf. The kiobuf interface is intended to hide much of the complexity of the virtual memory system from device drivers (and other parts of the system that do I/O). Many features are planned for kiobufs, but their primary use in the 2.4 kernel is to facilitate the mapping of user-space buffers into the kernel.

The kiobuf Structure

Any code that works with kiobufs must include <linux/iobuf.h>. This file defines struct kiobuf, which is the heart of the kiobuf interface. This structure describes an array of pages that make up an I/O operation; its fields include the following:

int nr_pages;

The number of pages in this kiobuf

int length;

The number of bytes of data in the buffer

int offset;

The offset to the first valid byte in the buffer

struct page **maplist;

An array of page structures, one for each page of data in the kiobuf

The key to the kiobuf interface is the maplist array. Functions that operate on pages stored in a kiobuf deal directly with the page structures—all of the virtual memory system overhead has been moved out of the way. This implementation allows drivers to function independent of the complexities of memory management, and in general simplifies life greatly.

Prior to use, a kiobuf must be initialized. It is rare to initialize a single kiobuf in isolation, but, if need be, this initialization can be performed with kiobuf_init:

void kiobuf_init(struct kiobuf *iobuf);

Usually kiobufs are allocated in groups as part of a kernel I/O vector, or kiovec. A kiovec can be allocated and initialized in one step with a call to alloc_kiovec:

int alloc_kiovec(int nr, struct kiobuf **iovec);

The return value is 0 or an error code, as usual. When your code has finished with the kiovec structure, it should, of course, return it to the system:

void free_kiovec(int nr, struct kiobuf **);

The kernel provides a pair of functions for locking and unlocking the pages mapped in a kiovec:

int lock_kiovec(int nr, struct kiobuf *iovec[], int wait);
int unlock_kiovec(int nr, struct kiobuf *iovec[]);

Locking a kiovec in this manner is unnecessary, however, for most applications of kiobufs seen in device drivers.

Mapping User-Space Buffers and Raw I/O

Unix systems have long provided a “raw” interface to some devices—block devices in particular—which performs I/O directly from a user-space buffer and avoids copying data through the kernel. In some cases much improved performance can be had in this manner, especially if the data being transferred will not be used again in the near future. For example, disk backups typically read a great deal of data from the disk exactly once, then forget about it. Running the backup via a raw interface will avoid filling the system buffer cache with useless data.

The Linux kernel has traditionally not provided a raw interface, for a number of reasons. As the system gains in popularity, however, more applications that expect to be able to do raw I/O (such as large database management systems) are being ported. So the 2.3 development series finally added raw I/O; the driving force behind the kiobuf interface was the need to provide this capability.

Raw I/O is not always the great performance boost that some people think it should be, and driver writers should not rush out to add the capability just because they can. The overhead of setting up a raw transfer can be significant, and the advantages of buffering data in the kernel are lost. For example, note that raw I/O operations almost always must be synchronous—the write system call cannot return until the operation is complete. Linux currently lacks the mechanisms that user programs need to be able to safely perform asynchronous raw I/O on a user buffer.

In this section, we add a raw I/O capability to the sbull sample block driver. When kiobufs are available, sbull actually registers two devices. The block sbull device was examined in detail in Chapter 12. What we didn’t see in that chapter was a second, char device (called sbullr), which provides raw access to the RAM-disk device. Thus, /dev/sbull0 and /dev/sbullr0 access the same memory; the former using the traditional, buffered mode and the second providing raw access via the kiobuf mechanism.

It is worth noting that in Linux systems, there is no need for block drivers to provide this sort of interface. The raw device, in drivers/char/raw.c, provides this capability in an elegant, general way for all block devices. The block drivers need not even know they are doing raw I/O. The raw I/O code in sbull is essentially a simplification of the raw device code for demonstration purposes.

Raw I/O to a block device must always be sector aligned, and its length must be a multiple of the sector size. Other kinds of devices, such as tape drives, may not have the same constraints. sbullr behaves like a block device and enforces the alignment and length requirements. To that end, it defines a few symbols:

#  define SBULLR_SECTOR 512  /* insist on this */
#  define SBULLR_SECTOR_MASK (SBULLR_SECTOR - 1)
#  define SBULLR_SECTOR_SHIFT 9

The sbullr raw device will be registered only if the hard-sector size is equal to SBULLR_SECTOR. There is no real reason why a larger hard-sector size could not be supported, but it would complicate the sample code unnecessarily.

The sbullr implementation adds little to the existing sbull code. In particular, the open and close methods from sbull are used without modification. Since sbullr is a char device, however, it needs read and write methods. Both are defined to use a single transfer function as follows:

ssize_t sbullr_read(struct file *filp, char *buf, size_t size, 
                    loff_t *off)
{
    Sbull_Dev *dev = sbull_devices + 
                    MINOR(filp->f_dentry->d_inode->i_rdev);
    return sbullr_transfer(dev, buf, size, off, READ);
}

ssize_t sbullr_write(struct file *filp, const char *buf, size_t size,
                loff_t *off)
{
    Sbull_Dev *dev = sbull_devices + 
                    MINOR(filp->f_dentry->d_inode->i_rdev);
    return sbullr_transfer(dev, (char *) buf, size, off, WRITE);
}

The sbullr_transfer function handles all of the setup and teardown work, while passing off the actual transfer of data to yet another function. It is written as follows:

static int sbullr_transfer (Sbull_Dev *dev, char *buf, size_t count,
                loff_t *offset, int rw)
{
    struct kiobuf *iobuf;       
    int result;
    
    /* Only block alignment and size allowed */
    if ((*offset & SBULLR_SECTOR_MASK) || (count & SBULLR_SECTOR_MASK))
        return -EINVAL;
    if ((unsigned long) buf & SBULLR_SECTOR_MASK)
        return -EINVAL;

    /* Allocate an I/O vector */
    result = alloc_kiovec(1, &iobuf);
    if (result)
        return result;

    /* Map the user I/O buffer and do the I/O. */
    result = map_user_kiobuf(rw, iobuf, (unsigned long) buf, count);
    if (result) {
        free_kiovec(1, &iobuf);
        return result;
    }
    spin_lock(&dev->lock);
    result = sbullr_rw_iovec(dev, iobuf, rw, 
                    *offset >> SBULLR_SECTOR_SHIFT,
                    count >> SBULLR_SECTOR_SHIFT);
    spin_unlock(&dev->lock);

    /* Clean up and return. */
    unmap_kiobuf(iobuf);
    free_kiovec(1, &iobuf);
    if (result > 0)
        *offset += result << SBULLR_SECTOR_SHIFT;
    return result << SBULLR_SECTOR_SHIFT;
}

After doing a couple of sanity checks, the code creates a kiovec (containing a single kiobuf) with alloc_kiovec. It then uses that kiovec to map in the user buffer by calling map_user_kiobuf:

int map_user_kiobuf(int rw, struct kiobuf *iobuf, 
                    unsigned long address, size_t len);

The result of this call, if all goes well, is that the buffer at the given (user virtual) address with length len is mapped into the given iobuf. This operation can sleep, since it is possible that part of the user buffer will need to be faulted into memory.

A kiobuf that has been mapped in this manner must eventually be unmapped, of course, to keep the reference counts on the pages straight. This unmapping is accomplished, as can be seen in the code, by passing the kiobuf to unmap_kiobuf.

So far, we have seen how to prepare a kiobuf for I/O, but not how to actually perform that I/O. The last step involves going through each page in the kiobuf and doing the required transfers; in sbullr, this task is handled by sbullr_rw_iovec. Essentially, this function passes through each page, breaks it up into sector-sized pieces, and passes them to sbull_transfer via a fake request structure:

static int sbullr_rw_iovec(Sbull_Dev *dev, struct kiobuf *iobuf, int rw,
                int sector, int nsectors)
{
    struct request fakereq;
    struct page *page;
    int offset = iobuf->offset, ndone = 0, pageno, result;

    /* Perform I/O on each sector */
    fakereq.sector = sector;
    fakereq.current_nr_sectors = 1;
    fakereq.cmd = rw;
    
    for (pageno = 0; pageno < iobuf->nr_pages; pageno++) {
        page = iobuf->maplist[pageno];
        while (ndone < nsectors) {
            /* Fake up a request structure for the operation */
            fakereq.buffer = (void *) (kmap(page) + offset);
            result = sbull_transfer(dev, &fakereq);
	    kunmap(page);
            if (result == 0)
                return ndone;
            /* Move on to the next one */
            ndone++;
            fakereq.sector++;
            offset += SBULLR_SECTOR;
            if (offset >= PAGE_SIZE) {
                offset = 0;
                break;
            }
        }
    }
    return ndone;
}

Here, the nr_pages member of the kiobuf structure tells us how many pages need to be transferred, and the maplist array gives us access to each page. Thus it is just a matter of stepping through them all. Note, however, that kmap is used to get a kernel virtual address for each page; in this way, the function will work even if the user buffer is in high memory.

Some quick tests copying data show that a copy to or from an sbullr device takes roughly two-thirds the system time as the same copy to the block sbull device. The savings is gained by avoiding the extra copy through the buffer cache. Note that if the same data is read several times over, that savings will evaporate—especially for a real hardware device. Raw device access is often not the best approach, but for some applications it can be a major improvement.

Although kiobufs remain controversial in the kernel development community, there is interest in using them in a wider range of contexts. There is, for example, a patch that implements Unix pipes with kiobufs—data is copied directly from one process’s address space to the other with no buffering in the kernel at all. A patch also exists that makes it easy to use a kiobuf to map kernel virtual memory into a process’s address space, thus eliminating the need for a nopage implementation as shown earlier.

Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.