Blocking I/O

One problem that might arise with read is what to do when there’s no data yet, but we’re not at end-of-file.

The default answer is “go to sleep waiting for data.” This section shows how a process is put to sleep, how it is awakened, and how an application can ask if there is data without just blindly issuing a read call and blocking. We then apply the same concepts to write.

As usual, before we show actual code, we’ll explain a few concepts.

Going to Sleep and Awakening

Whenever a process must wait for an event (such as the arrival of data or the termination of a process), it should go to sleep. Sleeping causes the process to suspend execution, freeing the processor for other uses. At some future time, when the event being waited for occurs, the process will be woken up and will continue with its job. This section discusses the 2.4 machinery for putting a process to sleep and waking it up. Earlier versions are discussed in Section 5.7 later in this chapter.

There are several ways of handling sleeping and waking up in Linux, each suited to different needs. All, however, work with the same basic data type, a wait queue (wait_queue_head_t). A wait queue is exactly that—a queue of processes that are waiting for an event. Wait queues are declared and initialized as follows:

 wait_queue_head_t my_queue;
 init_waitqueue_head (&my_queue);

When a wait queue is declared statically (i.e., not as an automatic variable of a procedure or as part of a dynamically-allocated data structure), it is also possible to initialize the queue at compile time:

 DECLARE_WAIT_QUEUE_HEAD (my_queue);

It is a common mistake to neglect to initialize a wait queue (especially since earlier versions of the kernel did not require this initialization); if you forget, the results will usually not be what you intended.

Once the wait queue is declared and initialized, a process may use it to go to sleep. Sleeping is accomplished by calling one of the variants of sleep_on, depending on how deep a sleep is called for.

sleep_on(wait_queue_head_t *queue);

Puts the process to sleep on this queue. sleep_on has the disadvantage of not being interruptible; as a result, the process can end up being stuck (and un-killable) if the event it’s waiting for never happens.

interruptible_sleep_on(wait_queue_head_t *queue);

The interruptible variant works just like sleep_on, except that the sleep can be interrupted by a signal. This is the form that device driver writers have been using for a long time, before wait_event_interruptible (described later) appeared.

sleep_on_timeout(wait_queue_head_t *queue, long timeout); , interruptible_sleep_on_timeout(wait_queue_head_t *queue, long timeout);

These two functions behave like the previous two, with the exception that the sleep will last no longer than the given timeout period. The timeout is specified in “jiffies,” which are covered in Chapter 6.

void wait_event(wait_queue_head_t queue, int condition); , int wait_event_interruptible(wait_queue_head_t queue, int condition);

These macros are the preferred way to sleep on an event. They combine waiting for an event and testing for its arrival in a way that avoids race conditions. They will sleep until the condition, which may be any boolean C expression, evaluates true. The macros expand to a while loop, and the condition is reevaluated over time—the behavior is different from that of a function call or a simple macro, where the arguments are evaluated only at call time. The latter macro is implemented as an expression that evaluates to 0 in case of success and -ERESTARTSYS if the loop is interrupted by a signal.

It is worth repeating that driver writers should almost always use the interruptible instances of these functions/macros. The noninterruptible version exists for the small number of situations in which signals cannot be dealt with, for example, when waiting for a data page to be retrieved from swap space. Most drivers do not present such special situations.

Of course, sleeping is only half of the problem; something, somewhere will have to wake the process up again. When a device driver sleeps directly, there is usually code in another part of the driver that performs the wakeup, once it knows that the event has occurred. Typically a driver will wake up sleepers in its interrupt handler once new data has arrived. Other scenarios are possible, however.

Just as there is more than one way to sleep, so there is also more than one way to wake up. The high-level functions provided by the kernel to wake up processes are as follows:

wake_up(wait_queue_head_t *queue);

This function will wake up all processes that are waiting on this event queue.

wake_up_interruptible(wait_queue_head_t *queue);

wake_up_interruptible wakes up only the processes that are in interruptible sleeps. Any process that sleeps on the wait queue using a noninterruptible function or macro will continue to sleep.

wake_up_sync(wait_queue_head_t *queue); , wake_up_interruptible_sync(wait_queue_head_t *queue);

Normally, a wake_up call can cause an immediate reschedule to happen, meaning that other processes might run before wake_up returns. The “synchronous” variants instead make any awakened processes runnable, but do not reschedule the CPU. This is used to avoid rescheduling when the current process is known to be going to sleep, thus forcing a reschedule anyway. Note that awakened processes could run immediately on a different processor, so these functions should not be expected to provide mutual exclusion.

If your driver is using interruptible_sleep_on, there is little difference between wake_up and wake_up_interruptible. Calling the latter is a common convention, however, to preserve consistency between the two calls.

As an example of wait queue usage, imagine you want to put a process to sleep when it reads your device and awaken it when someone else writes to the device. The following code does just that:

DECLARE_WAIT_QUEUE_HEAD(wq);

ssize_t sleepy_read (struct file *filp, char *buf, size_t count, 
   loff_t *pos)
{
  printk(KERN_DEBUG "process %i (%s) going to sleep\n",
      current->pid, current->comm);
  interruptible_sleep_on(&wq);
  printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
  return 0; /* EOF */
}

ssize_t sleepy_write (struct file *filp, const char *buf, size_t count,
		loff_t *pos)
{
  printk(KERN_DEBUG "process %i (%s) awakening the readers...\n",
     current->pid, current->comm);
  wake_up_interruptible(&wq);
  return count; /* succeed, to avoid retrial */
}

The code for this device is available as sleepy in the example programs and can be tested using cat and input/output redirection, as usual.

An important thing to remember with wait queues is that being woken up does not guarantee that the event you were waiting for has occurred; a process can be woken for other reasons, mainly because it received a signal. Any code that sleeps should do so in a loop that tests the condition after returning from the sleep, as discussed in Section 5.2.5 later in this chapter.

A Deeper Look at Wait Queues

The previous discussion is all that most driver writers will need to know to get their job done. Some, however, will want to dig deeper. This section attempts to get the curious started; everybody else can skip to the next section without missing much that is important.

The wait_queue_head_t type is a fairly simple structure, defined in <linux/wait.h>. It contains only a lock variable and a linked list of sleeping processes. The individual data items in the list are of type wait_queue_t, and the list is the generic list defined in <linux/list.h> and described in Section 10.5 in Chapter 10. Normally the wait_queue_t structures are allocated on the stack by functions like interruptible_sleep_on; the structures end up in the stack because they are simply declared as automatic variables in the relevant functions. In general, the programmer need not deal with them.

Some advanced applications, however, can require dealing with wait_queue_t variables directly. For these, it’s worth a quick look at what actually goes on inside a function like interruptible_sleep_on. The following is a simplified version of the implementation of interruptible_sleep_on to put a process to sleep:

 void simplified_sleep_on(wait_queue_head_t *queue)
 {
   wait_queue_t wait;

   init_waitqueue_entry(&wait, current);
   current->state = TASK_INTERRUPTIBLE;

   add_wait_queue(queue, &wait);
   schedule();
   remove_wait_queue (queue, &wait);
  }

The code here creates a new wait_queue_t variable (wait, which gets allocated on the stack) and initializes it. The state of the task is set to TASK_INTERRUPTIBLE, meaning that it is in an interruptible sleep. The wait queue entry is then added to the queue (the wait_queue_head_t * argument). Then schedule is called, which relinquishes the processor to somebody else. schedule returns only when somebody else has woken up the process and set its state to TASK_RUNNING. At that point, the wait queue entry is removed from the queue, and the sleep is done.

Figure 5-1 shows the internals of the data structures involved in wait queues and how they are used by processes.

Wait queues in Linux 2.4

Figure 5-1. Wait queues in Linux 2.4

A quick look through the kernel shows that a great many procedures do their sleeping “manually” with code that looks like the previous example. Most of those implementations date back to kernels prior to 2.2.3, before wait_event was introduced. As suggested, wait_event is now the preferred way to sleep on an event, because interruptible_sleep_on is subject to unpleasant race conditions. A full description of how that can happen will have to wait until Section 9.8.4 in Chapter 9; the short version, simply, is that things can change in the time between when your driver decides to sleep and when it actually gets around to calling interruptible_sleep_on.

One other reason for calling the scheduler explicitly, however, is to do exclusive waits. There can be situations in which several processes are waiting on an event; when wake_up is called, all of those processes will try to execute. Suppose that the event signifies the arrival of an atomic piece of data. Only one process will be able to read that data; all the rest will simply wake up, see that no data is available, and go back to sleep.

This situation is sometimes referred to as the “thundering herd problem.” In high-performance situations, thundering herds can waste resources in a big way. The creation of a large number of runnable processes that can do no useful work generates a large number of context switches and processor overhead, all for nothing. Things would work better if those processes simply remained asleep.

For this reason, the 2.3 development series added the concept of an exclusive sleep. If processes sleep in an exclusive mode, they are telling the kernel to wake only one of them. The result is improved performance in some situations.

The code to perform an exclusive sleep looks very similar to that for a regular sleep:

 void simplified_sleep_exclusive(wait_queue_head_t *queue)
 {
   wait_queue_t wait;

   init_waitqueue_entry(&wait, current);
   current->state = TASK_INTERRUPTIBLE | TASK_EXCLUSIVE;

   add_wait_queue_exclusive(queue, &wait);
   schedule();
   remove_wait_queue (queue, &wait);
  }

Adding the TASK_EXCLUSIVE flag to the task state indicates that the process is in an exclusive wait. The call to add_wait_queue_exclusive is also necessary, however. That function adds the process to the end of the wait queue, behind all others. The purpose is to leave any processes in nonexclusive sleeps at the beginning, where they will always be awakened. As soon as wake_up hits the first exclusive sleeper, it knows it can stop.

The attentive reader may have noticed another reason to manipulate wait queues and the scheduler explicitly. Whereas functions like sleep_on will block a process on exactly one wait queue, working with the queues directly allows sleeping on multiple queues simultaneously. Most drivers need not sleep on more than one queue; if yours is the exception, you will need to use code like what we’ve shown.

Those wanting to dig even deeper into the wait queue code can look at <linux/sched.h> and kernel/sched.c.

Writing Reentrant Code

When a process is put to sleep, the driver is still alive and can be called by another process. Let’s consider the console driver as an example. While an application is waiting for keyboard input on tty1, the user switches to tty2 and spawns a new shell. Now both shells are waiting for keyboard input within the console driver, although they sleep on different wait queues: one on the queue associated with tty1 and the other on the queue associated with tty2. Each process is blocked within the interruptible_sleep_on function, but the driver can still receive and answer requests from other ttys.

Of course, on SMP systems, multiple simultaneous calls to your driver can happen even when you do not sleep.

Such situations can be handled painlessly by writing reentrant code. Reentrant code is code that doesn’t keep status information in global variables and thus is able to manage interwoven invocations without mixing anything up. If all the status information is process specific, no interference will ever happen.

If status information is needed, it can either be kept in local variables within the driver function (each process has a different stack page in kernel space where local variables are stored), or it can reside in private_data within the filp accessing the file. Using local variables is preferred because sometimes the same filp can be shared between two processes (usually parent and child).

If you need to save large amounts of status data, you can keep the pointer in a local variable and use kmalloc to retrieve the actual storage space. In this case you must remember to kfree the data, because there’s no equivalent to “everything is released at process termination” when you’re working in kernel space. Using local variables for large items is not good practice, because the data may not fit the single page of memory allocated for stack space.

You need to make reentrant any function that matches either of two conditions. First, if it calls schedule, possibly by calling sleep_on or wake_up. Second, if it copies data to or from user space, because access to user space might page-fault, and the process will be put to sleep while the kernel deals with the missing page. Every function that calls any such functions must be reentrant as well. For example, if sample_read calls sample_getdata, which in turn can block, then sample_read must be reentrant as well as sample_getdata, because nothing prevents another process from calling it while it is already executing on behalf of a process that went to sleep.

Finally, of course, code that sleeps should always keep in mind that the state of the system can change in almost any way while a process is sleeping. The driver should be careful to check any aspect of its environment that might have changed while it wasn’t paying attention.

Blocking and Nonblocking Operations

Another point we need to touch on before we look at the implementation of full-featured read and write methods is the role of the O_NONBLOCK flag in filp->f_flags. The flag is defined in <linux/fcntl.h>, which is automatically included by <linux/fs.h>.

The flag gets its name from “open-nonblock,” because it can be specified at open time (and originally could only be specified there). If you browse the source code, you’ll find some references to an O_NDELAY flag; this is an alternate name for O_NONBLOCK, accepted for compatibility with System V code. The flag is cleared by default, because the normal behavior of a process waiting for data is just to sleep. In the case of a blocking operation, which is the default, the following behavior should be implemented in order to adhere to the standard semantics:

  • If a process calls read but no data is (yet) available, the process must block. The process is awakened as soon as some data arrives, and that data is returned to the caller, even if there is less than the amount requested in the count argument to the method.

  • If a process calls write and there is no space in the buffer, the process must block, and it must be on a different wait queue from the one used for reading. When some data has been written to the hardware device, and space becomes free in the output buffer, the process is awakened and the write call succeeds, although the data may be only partially written if there isn’t room in the buffer for the count bytes that were requested.

Both these statements assume that there are both input and output buffers; in practice, almost every device driver has them. The input buffer is required to avoid losing data that arrives when nobody is reading. In contrast, data can’t be lost on write, because if the system call doesn’t accept data bytes, they remain in the user-space buffer. Even so, the output buffer is almost always useful for squeezing more performance out of the hardware.

The performance gain of implementing an output buffer in the driver results from the reduced number of context switches and user-level/kernel-level transitions. Without an output buffer (assuming a slow device), only one or a few characters are accepted by each system call, and while one process sleeps in write, another process runs (that’s one context switch). When the first process is awakened, it resumes (another context switch), write returns (kernel/user transition), and the process reiterates the system call to write more data (user/kernel transition); the call blocks, and the loop continues. If the output buffer is big enough, the write call succeeds on the first attempt—the buffered data will be pushed out to the device later, at interrupt time—without control needing to go back to user space for a second or third write call. The choice of a suitable size for the output buffer is clearly device specific.

We didn’t use an input buffer in scull, because data is already available when read is issued. Similarly, no output buffer was used, because data is simply copied to the memory area associated with the device. Essentially, the device is a buffer, so the implementation of additional buffers would be superfluous. We’ll see the use of buffers in Chapter 9, in Section 9.7.

The behavior of read and write is different if O_NONBLOCK is specified. In this case, the calls simply return -EAGAIN if a process calls read when no data is available or if it calls write when there’s no space in the buffer.

As you might expect, nonblocking operations return immediately, allowing the application to poll for data. Applications must be careful when using the stdio functions while dealing with nonblocking files, because they can easily mistake a nonblocking return for EOF. They always have to check errno.

Naturally, O_NONBLOCK is meaningful in the open method also. This happens when the call can actually block for a long time; for example, when opening a FIFO that has no writers (yet), or accessing a disk file with a pending lock. Usually, opening a device either succeeds or fails, without the need to wait for external events. Sometimes, however, opening the device requires a long initialization, and you may choose to support O_NONBLOCK in your open method by returning immediately with -EAGAIN (“try it again”) if the flag is set, after initiating device initialization. The driver may also implement a blocking open to support access policies in a way similar to file locks. We’ll see one such implementation in Section 5.6.4 later in this chapter.

Some drivers may also implement special semantics for O_NONBLOCK; for example, an open of a tape device usually blocks until a tape has been inserted. If the tape drive is opened with O_NONBLOCK, the open succeeds immediately regardless of whether the media is present or not.

Only the read, write, and open file operations are affected by the nonblocking flag.

A Sample Implementation: scullpipe

The /dev/scullpipe devices (there are four of them by default) are part of the scull module and are used to show how blocking I/O is implemented.

Within a driver, a process blocked in a read call is awakened when data arrives; usually the hardware issues an interrupt to signal such an event, and the driver awakens waiting processes as part of handling the interrupt. The scull driver works differently, so that it can be run without requiring any particular hardware or an interrupt handler. We chose to use another process to generate the data and wake the reading process; similarly, reading processes are used to wake sleeping writer processes. The resulting implementation is similar to that of a FIFO (or named pipe) filesystem node, whence the name.

The device driver uses a device structure that embeds two wait queues and a buffer. The size of the buffer is configurable in the usual ways (at compile time, load time, or runtime).

typedef struct Scull_Pipe {
  wait_queue_head_t inq, outq;  /* read and write queues */
  char *buffer, *end;       /* begin of buf, end of buf */
  int buffersize;         /* used in pointer arithmetic */
  char *rp, *wp;         /* where to read, where to write */
  int nreaders, nwriters;     /* number of openings for r/w */
  struct fasync_struct *async_queue; /* asynchronous readers */
  struct semaphore sem;      /* mutual exclusion semaphore */
  devfs_handle_t handle;     /* only used if devfs is there */
} Scull_Pipe;

The read implementation manages both blocking and nonblocking input and looks like this (the puzzling first line of the function is explained later, in Section 5.5):

ssize_t scull_p_read (struct file *filp, char *buf, size_t count,
        loff_t *f_pos)
{
  Scull_Pipe *dev = filp->private_data;

  if (f_pos != &filp->f_pos) return -ESPIPE;

  if (down_interruptible(&dev->sem))
    return -ERESTARTSYS;
  while (dev->rp == dev->wp) { /* nothing to read */
    up(&dev->sem); /* release the lock */
    if (filp->f_flags & O_NONBLOCK)
      return -EAGAIN;
    PDEBUG("\"%s\" reading: going to sleep\n", current->comm);
    if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
      return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
    /* otherwise loop, but first reacquire the lock */
    if (down_interruptible(&dev->sem))
      return -ERESTARTSYS;
  }
  /* ok, data is there, return something */
  if (dev->wp > dev->rp)
    count = min(count, dev->wp - dev->rp);
  else /* the write pointer has wrapped, return data up to dev->end */
    count = min(count, dev->end - dev->rp);
  if (copy_to_user(buf, dev->rp, count)) {
    up (&dev->sem);
    return -EFAULT;
  }
  dev->rp += count;
  if (dev->rp == dev->end)
    dev->rp = dev->buffer; /* wrapped */
  up (&dev->sem);

  /* finally, awaken any writers and return */
  wake_up_interruptible(&dev->outq);
  PDEBUG("\"%s\" did read %li bytes\n",current->comm, (long)count);
  return count;
}

As you can see, we left some PDEBUG statements in the code. When you compile the driver, you can enable messaging to make it easier to follow the interaction of different processes.

Note also, once again, the use of semaphores to protect critical regions of the code. The scull code has to be careful to avoid going to sleep when it holds a semaphore—otherwise, writers would never be able to add data, and the whole thing would deadlock. This code uses wait_event_interruptible to wait for data if need be; it has to check for available data again after the wait, though. Somebody else could grab the data between when we wake up and when we get the semaphore back.

It’s worth repeating that a process can go to sleep both when it calls schedule, either directly or indirectly, and when it copies data to or from user space. In the latter case the process may sleep if the user array is not currently present in main memory. If scull sleeps while copying data between kernel and user space, it will sleep with the device semaphore held. Holding the semaphore in this case is justified since it will not deadlock the system, and since it is important that the device memory array not change while the driver sleeps.

The if statement that follows interruptible_sleep_on takes care of signal handling. This statement ensures the proper and expected reaction to signals, which could have been responsible for waking up the process (since we were in an interruptible sleep). If a signal has arrived and it has not been blocked by the process, the proper behavior is to let upper layers of the kernel handle the event. To this aim, the driver returns -ERESTARTSYS to the caller; this value is used internally by the virtual filesystem (VFS) layer, which either restarts the system call or returns -EINTR to user space. We’ll use the same statement to deal with signal handling for every read and write implementation. Because signal_pending was introduced only in version 2.1.57 of the kernel, sysdep.h defines it for earlier kernels to preserve portability of source code.

The implementation for write is quite similar to that for read (and, again, its first line will be explained later). Its only “peculiar” feature is that it never completely fills the buffer, always leaving a hole of at least one byte. Thus, when the buffer is empty, wp and rp are equal; when there is data there, they are always different.

static inline int spacefree(Scull_Pipe *dev)
{
  if (dev->rp == dev->wp)
    return dev->buffersize - 1;
  return ((dev->rp + dev->buffersize - dev->wp) % dev->buffersize) - 1;
}

ssize_t scull_p_write(struct file *filp, const char *buf, size_t count,
        loff_t *f_pos)
{
  Scull_Pipe *dev = filp->private_data;
  
  if (f_pos != &filp->f_pos) return -ESPIPE;

  if (down_interruptible(&dev->sem))
    return -ERESTARTSYS;
  
  /* Make sure there's space to write */
  while (spacefree(dev) == 0) { /* full */
    up(&dev->sem);
    if (filp->f_flags & O_NONBLOCK)
      return -EAGAIN;
    PDEBUG("\"%s\" writing: going to sleep\n",current->comm);
    if (wait_event_interruptible(dev->outq, spacefree(dev) > 0))
      return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
    if (down_interruptible(&dev->sem))
      return -ERESTARTSYS;
  }
  /* ok, space is there, accept something */
  count = min(count, spacefree(dev));
  if (dev->wp >= dev->rp)
    count = min(count, dev->end - dev->wp); /* up to end-of-buffer */
  else /* the write pointer has wrapped, fill up to rp-1 */
    count = min(count, dev->rp - dev->wp - 1);
  PDEBUG("Going to accept %li bytes to %p from %p\n",
      (long)count, dev->wp, buf);
  if (copy_from_user(dev->wp, buf, count)) {
    up (&dev->sem);
    return -EFAULT;
  }
  dev->wp += count;
  if (dev->wp == dev->end)
    dev->wp = dev->buffer; /* wrapped */
  up(&dev->sem);

  /* finally, awaken any reader */
  wake_up_interruptible(&dev->inq); /* blocked in read() and select() */

  /* and signal asynchronous readers, explained later in Chapter 5 */
  if (dev->async_queue)
    kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
  PDEBUG("\"%s\" did write %li bytes\n",current->comm, (long)count);
  return count;
}

The device, as we conceived it, doesn’t implement blocking open and is simpler than a real FIFO. If you want to look at the real thing, you can find it in fs/pipe.c, in the kernel sources.

To test the blocking operation of the scullpipe device, you can run some programs on it, using input/output redirection as usual. Testing nonblocking activity is trickier, because the conventional programs don’t perform nonblocking operations. The misc-progs source directory contains the following simple program, called nbtest, for testing nonblocking operations. All it does is copy its input to its output, using nonblocking I/O and delaying between retrials. The delay time is passed on the command line and is one second by default.

int main(int argc, char **argv)
{
  int delay=1, n, m=0;

  if (argc>1) delay=atoi(argv[1]);
  fcntl(0, F_SETFL, fcntl(0,F_GETFL) | O_NONBLOCK); /* stdin */
  fcntl(1, F_SETFL, fcntl(1,F_GETFL) | O_NONBLOCK); /* stdout */

  while (1) {
    n=read(0, buffer, 4096);
    if (n>=0)
      m=write(1, buffer, n);
    if ((n<0 || m<0) && (errno != EAGAIN))
      break;
    sleep(delay);
  }
  perror( n<0 ? "stdin" : "stdout");
  exit(1);
}

Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.