As of version 2.3.12, the Linux kernel supports an I/O abstraction
called the kernel I/O buffer, or
kiobuf
. The kiobuf interface is intended to hide
much of the complexity of the virtual memory system from device
drivers (and other parts of the system that do I/O). Many features
are planned for kiobufs, but their primary use in the 2.4 kernel is to
facilitate the mapping of user-space buffers into the kernel.
Any code that works with kiobufs must include
<linux/iobuf.h>
. This file defines
struct kiobuf
, which is the heart of the kiobuf
interface. This structure describes an array of pages that make up an
I/O operation; its fields include the following:
The key to the kiobuf interface is the maplist
array. Functions that operate on pages stored in a kiobuf deal
directly with the page
structures—all of the
virtual memory system overhead has been moved out of the way. This
implementation allows drivers to function independent of the
complexities of memory management, and in general simplifies life
greatly.
Prior to use, a kiobuf must be initialized. It is rare to initialize a single kiobuf in isolation, but, if need be, this initialization can be performed with kiobuf_init:
void kiobuf_init(struct kiobuf *iobuf);
Usually kiobufs are allocated in groups as part of a kernel I/O vector, or kiovec. A kiovec can be allocated and initialized in one step with a call to alloc_kiovec:
int alloc_kiovec(int nr, struct kiobuf **iovec);
The return value is 0 or an error code, as usual. When your code has finished with the kiovec structure, it should, of course, return it to the system:
void free_kiovec(int nr, struct kiobuf **);
The kernel provides a pair of functions for locking and unlocking the pages mapped in a kiovec:
int lock_kiovec(int nr, struct kiobuf *iovec[], int wait); int unlock_kiovec(int nr, struct kiobuf *iovec[]);
Locking a kiovec in this manner is unnecessary, however, for most applications of kiobufs seen in device drivers.
Unix systems have long provided a “raw” interface to some devices—block devices in particular—which performs I/O directly from a user-space buffer and avoids copying data through the kernel. In some cases much improved performance can be had in this manner, especially if the data being transferred will not be used again in the near future. For example, disk backups typically read a great deal of data from the disk exactly once, then forget about it. Running the backup via a raw interface will avoid filling the system buffer cache with useless data.
The Linux kernel has traditionally not provided a raw interface, for a number of reasons. As the system gains in popularity, however, more applications that expect to be able to do raw I/O (such as large database management systems) are being ported. So the 2.3 development series finally added raw I/O; the driving force behind the kiobuf interface was the need to provide this capability.
Raw I/O is not always the great performance boost that some people think it should be, and driver writers should not rush out to add the capability just because they can. The overhead of setting up a raw transfer can be significant, and the advantages of buffering data in the kernel are lost. For example, note that raw I/O operations almost always must be synchronous—the write system call cannot return until the operation is complete. Linux currently lacks the mechanisms that user programs need to be able to safely perform asynchronous raw I/O on a user buffer.
In this section, we add a raw I/O capability to the
sbull sample block driver. When kiobufs
are available, sbull actually registers two
devices. The block sbull device was
examined in detail in Chapter 12. What we didn’t see in
that chapter was a second, char device (called
sbullr), which provides raw access to the
RAM-disk device. Thus, /dev/sbull0
and
/dev/sbullr0
access the same memory; the former
using the traditional, buffered mode and the second providing raw
access via the kiobuf mechanism.
It is worth noting that in Linux systems, there is no need for block
drivers to provide this sort of interface. The raw device, in
drivers/char/raw.c
, provides this capability in
an elegant, general way for all block devices. The block drivers need
not even know they are doing raw I/O. The raw I/O code in
sbull is essentially a simplification of
the raw device code for demonstration purposes.
Raw I/O to a block device must always be sector aligned, and its length must be a multiple of the sector size. Other kinds of devices, such as tape drives, may not have the same constraints. sbullr behaves like a block device and enforces the alignment and length requirements. To that end, it defines a few symbols:
# define SBULLR_SECTOR 512 /* insist on this */ # define SBULLR_SECTOR_MASK (SBULLR_SECTOR - 1) # define SBULLR_SECTOR_SHIFT 9
The sbullr raw device will be registered
only if the hard-sector size is equal to
SBULLR_SECTOR
. There is no real reason why a
larger hard-sector size could not be supported, but it would
complicate the sample code unnecessarily.
The sbullr implementation adds little to the existing sbull code. In particular, the open and close methods from sbull are used without modification. Since sbullr is a char device, however, it needs read and write methods. Both are defined to use a single transfer function as follows:
ssize_t sbullr_read(struct file *filp, char *buf, size_t size, loff_t *off) { Sbull_Dev *dev = sbull_devices + MINOR(filp->f_dentry->d_inode->i_rdev); return sbullr_transfer(dev, buf, size, off, READ); } ssize_t sbullr_write(struct file *filp, const char *buf, size_t size, loff_t *off) { Sbull_Dev *dev = sbull_devices + MINOR(filp->f_dentry->d_inode->i_rdev); return sbullr_transfer(dev, (char *) buf, size, off, WRITE); }
The sbullr_transfer function handles all of the setup and teardown work, while passing off the actual transfer of data to yet another function. It is written as follows:
static int sbullr_transfer (Sbull_Dev *dev, char *buf, size_t count, loff_t *offset, int rw) { struct kiobuf *iobuf; int result; /* Only block alignment and size allowed */ if ((*offset & SBULLR_SECTOR_MASK) || (count & SBULLR_SECTOR_MASK)) return -EINVAL; if ((unsigned long) buf & SBULLR_SECTOR_MASK) return -EINVAL; /* Allocate an I/O vector */ result = alloc_kiovec(1, &iobuf); if (result) return result; /* Map the user I/O buffer and do the I/O. */ result = map_user_kiobuf(rw, iobuf, (unsigned long) buf, count); if (result) { free_kiovec(1, &iobuf); return result; } spin_lock(&dev->lock); result = sbullr_rw_iovec(dev, iobuf, rw, *offset >> SBULLR_SECTOR_SHIFT, count >> SBULLR_SECTOR_SHIFT); spin_unlock(&dev->lock); /* Clean up and return. */ unmap_kiobuf(iobuf); free_kiovec(1, &iobuf); if (result > 0) *offset += result << SBULLR_SECTOR_SHIFT; return result << SBULLR_SECTOR_SHIFT; }
After doing a couple of sanity checks, the code creates a kiovec (containing a single kiobuf) with alloc_kiovec. It then uses that kiovec to map in the user buffer by calling map_user_kiobuf:
int map_user_kiobuf(int rw, struct kiobuf *iobuf, unsigned long address, size_t len);
The result of this call, if all goes well, is that the buffer at the
given (user virtual) address
with length
len
is mapped into the given
iobuf
. This operation can sleep, since it is
possible that part of the user buffer will need to be faulted into
memory.
A kiobuf that has been mapped in this manner must eventually be unmapped, of course, to keep the reference counts on the pages straight. This unmapping is accomplished, as can be seen in the code, by passing the kiobuf to unmap_kiobuf.
So far, we have seen how to prepare a kiobuf for I/O, but not how to
actually perform that I/O. The last step involves going through each
page in the kiobuf and doing the required transfers; in
sbullr, this task is handled by
sbullr_rw_iovec. Essentially, this function
passes through each page, breaks it up into sector-sized pieces, and
passes them to sbull_transfer via a fake
request
structure:
static int sbullr_rw_iovec(Sbull_Dev *dev, struct kiobuf *iobuf, int rw, int sector, int nsectors) { struct request fakereq; struct page *page; int offset = iobuf->offset, ndone = 0, pageno, result; /* Perform I/O on each sector */ fakereq.sector = sector; fakereq.current_nr_sectors = 1; fakereq.cmd = rw; for (pageno = 0; pageno < iobuf->nr_pages; pageno++) { page = iobuf->maplist[pageno]; while (ndone < nsectors) { /* Fake up a request structure for the operation */ fakereq.buffer = (void *) (kmap(page) + offset); result = sbull_transfer(dev, &fakereq); kunmap(page); if (result == 0) return ndone; /* Move on to the next one */ ndone++; fakereq.sector++; offset += SBULLR_SECTOR; if (offset >= PAGE_SIZE) { offset = 0; break; } } } return ndone; }
Here, the nr_pages
member of the
kiobuf
structure tells us how many pages need to be
transferred, and the maplist
array gives us access
to each page. Thus it is just a matter of stepping through them all.
Note, however, that kmap is used to get a kernel
virtual address for each page; in this way, the function will work
even if the user buffer is in high memory.
Some quick tests copying data show that a copy to or from an sbullr device takes roughly two-thirds the system time as the same copy to the block sbull device. The savings is gained by avoiding the extra copy through the buffer cache. Note that if the same data is read several times over, that savings will evaporate—especially for a real hardware device. Raw device access is often not the best approach, but for some applications it can be a major improvement.
Although kiobufs remain controversial in the kernel development community, there is interest in using them in a wider range of contexts. There is, for example, a patch that implements Unix pipes with kiobufs—data is copied directly from one process’s address space to the other with no buffering in the kernel at all. A patch also exists that makes it easy to use a kiobuf to map kernel virtual memory into a process’s address space, thus eliminating the need for a nopage implementation as shown earlier.
Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.