Race Conditions

We have already seen race conditions come up a number of times in the previous chapters. Whereas race conditions can happen at any time on SMP systems, uniprocessor systems, to this point, have had to worry about them rather less.[38] Interrupts, however, can bring with them a whole new set of race conditions, even on uniprocessor systems. Since an interrupt can happen at any time, it can cause the interrupt handler to be executed in the middle of an arbitrary piece of driver code. Thus, any device driver that is working with interrupts—and that is most of them—must be very concerned with race conditions. For this reason, we look more closely at race conditions and their prevention in this chapter.

Dealing with race conditions is one of the trickiest aspects of programming, because the related bugs are subtle and very difficult to reproduce, and it’s hard to tell when there is a race condition between interrupt code and the driver methods. The programmer must take great care to avoid corruption of data or metadata.

Different techniques can be employed to prevent data corruption, and we will introduce the most common ones. We won’t show complete code because the best code for each situation depends on the operating mode of the device being driven, and on the programmer’s taste. All of the drivers in this book, however, protect themselves against race conditions, so examples can be found in the sample code.

The most common ways of protecting data from concurrent access are as follows:

  • Using a circular buffer and avoiding shared variables

  • Using spinlocks to enforce mutual exclusion

  • Using lock variables that are atomically incremented and decremented

Note that semaphores are not listed here. Because locking a semaphore can put a process to sleep, semaphores may not be used in interrupt handlers.

Whatever approach you choose, you still need to decide what to do when accessing a variable that can be modified at interrupt time. In simple cases, such a variable can simply be declared as volatile to prevent the compiler from optimizing access to its value (for example, it prevents the compiler from holding the value in a register for the whole duration of a function). However, the compiler generates suboptimal code whenever volatile variables are involved, so you might choose to resort to some sort of locking instead. In more complicated situations, there is no choice but to use some sort of locking.

Using Circular Buffers

Using a circular buffer is an effective way of handling concurrent-access problems; the best way to deal with concurrent access is to perform no concurrent access whatsoever.

The circular buffer uses an algorithm called “producer and consumer”: one player pushes data in and the other pulls data out. Concurrent access is avoided if there is exactly one producer and exactly one consumer. There are two examples of producer and consumer in short. In one case, the reading process is waiting to consume data that is produced at interrupt time; in the other, the bottom half consumes data produced by the top half.

Two pointers are used to address a circular buffer: head and tail. head is the point at which data is being written and is updated only by the producer of the data. Data is being read from tail, which is updated only by the consumer. As mentioned earlier, if data is written at interrupt time, you must be careful when accessing head multiple times. You should either declare it as volatile or use some sort of locking.

The circular buffer runs smoothly, except when it fills up. If that happens, things become hairy, and you can choose among different possible solutions. The short implementation just loses data; there’s no check for overflow, and if head goes beyond tail, a whole buffer of data is lost. Some alternative implementations are to drop the last item; to overwrite the buffer tail, as printk does (see Section 4.1.2 in Chapter 4); to hold up the producer, as scullpipe does; or to allocate a temporary extra buffer to back up the main buffer. The best solution depends on the importance of your data and other situation-specific questions, so we won’t cover it here.

Although the circular buffer appears to solve the problem of concurrent access, there is still the possibility of a race condition when the read function goes to sleep. This code shows where the problem appears in short:

while (short_head == short_tail) {
    /* ... */

When executing this statement, it is possible that new data will arrive after the while condition is evaluated as true and before the process goes to sleep. Information carried in by the interrupt won’t be read by the process; the process goes to sleep even though head != tail, and it isn’t awakened until the next data item arrives.

We didn’t implement correct locking for short because the source of short_read is included in Section 8.3.2 in Chapter 8, and at that point this discussion was not worth introducing. Also, the data involved is not worth the effort.

Although the data that short collects is not vital, and the likelihood of getting an interrupt in the time lapse between two successive instructions is often negligible, sometimes you just can’t take the risk of going to sleep when data is pending. This problem is general enough to deserve special treatment and is delayed to Section 9.8.4 later in this chapter, where we’ll discuss it in detail.

It’s interesting to note that only a producer-and-consumer situation can be addressed with a circular buffer. A programmer must often deal with more complex data structures to solve the concurrent-access problem. The producer/consumer situation is actually the simplest class of these problems; other structures, such as linked lists, simply don’t lend themselves to a circular buffer implementation.

Using Spinlocks

We have seen spinlocks before, for example, in the scull driver. The discussion thus far has looked only at a few uses of spinlocks; in this section we cover them in rather more detail.

A spinlock, remember, works through a shared variable. A function may acquire the lock by setting the variable to a specific value. Any other function needing the lock will query it and, seeing that it is not available, will “spin” in a busy-wait loop until it is available. Spinlocks thus need to be used with care. A function that holds a spinlock for too long can waste much time because other CPUs are forced to wait.

Spinlocks are represented by the type spinlock_t, which, along with the various spinlock functions, is declared in <asm/spinlock.h>. Normally, a spinlock is declared and initialized to the unlocked state with a line like:

spinlock_t my_lock = SPIN_LOCK_UNLOCKED;

If, instead, it is necessary to initialize a spinlock at runtime, use spin_lock_init:


There are a number of functions (actually macros) that work with spinlocks:

spin_lock(spinlock_t *lock);

Acquire the given lock, spinning if necessary until it is available. On return from spin_lock, the calling function owns the lock.

spin_lock_irqsave(spinlock_t *lock, unsigned long flags);

This version also acquires the lock; in addition, it disables interrupts on the local processor and stores the current interrupt state in flags. Note that all of the spinlock primitives are defined as macros, and that the flags argument is passed directly, not as a pointer.

spin_lock_irq(spinlock_t *lock);

This function acts like spin_lock_irqsave, except that it does not save the current interrupt state. This version is slightly more efficient than spin_lock_irqsave, but it should only be used in situations in which you know that interrupts will not have already been disabled.

spin_lock_bh(spinlock_t *lock);

Obtains the given lock and prevents the execution of bottom halves.

spin_unlock(spinlock_t *lock); , spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags); , spin_unlock_irq(spinlock_t *lock); , spin_unlock_bh(spinlock_t *lock);

These functions are the counterparts of the various locking primitives described previously. spin_unlock unlocks the given lock and nothing else. spin_unlock_irqrestore possibly enables interrupts, depending on the flags value (which should have come from spin_lock_irqsave). spin_unlock_irq enables interrupts unconditionally, and spin_unlock_bh reenables bottom-half processing. In each case, your function should be in possession of the lock before calling one of the unlocking primitives, or serious disorder will result.

spin_is_locked(spinlock_t *lock); , spin_trylock(spinlock_t *lock) , spin_unlock_wait(spinlock_t *lock);

spin_is_locked queries the state of a spinlock without changing it. It returns nonzero if the lock is currently busy. To attempt to acquire a lock without waiting, use spin_trylock, which returns nonzero if the operation failed (the lock was busy). spin_unlock_wait waits until the lock becomes free, but does not take possession of it.

Many users of spinlocks stick to spin_lock and spin_unlock. If you are using spinlocks in interrupt handlers, however, you must use the IRQ-disabling versions (usually spin_lock_irqsave and spin_unlock_irqsave) in the noninterrupt code. To do otherwise is to invite a deadlock situation.

It is worth considering an example here. Assume that your driver is running in its read method, and it obtains a lock with spin_lock. While the read method is holding the lock, your device interrupts, and your interrupt handler is executed on the same processor. If it attempts to use the same lock, it will go into a busy-wait loop, since your read method already holds the lock. But, since the interrupt routine has preempted that method, the lock will never be released and the processor deadlocks, which is probably not what you wanted.

This problem can be avoided by using spin_lock_irqsave to disable interrupts on the local processor while the lock is held. When in doubt, use the _irqsave versions of the primitives and you will not need to worry about deadlocks. Remember, though, that the flags value from spin_lock_irqsave must not be passed to other functions.

Regular spinlocks work well for most situations encountered by device driver writers. In some cases, however, there is a particular pattern of access to critical data that is worth treating specially. If you have a situation in which numerous threads (processes, interrupt handlers, bottom-half routines) need to access critical data in a read-only mode, you may be worried about the overhead of using spinlocks. Numerous readers cannot interfere with each other; only a writer can create problems. In such situations, it is far more efficient to allow all readers to access the data simultaneously.

Linux has a different type of spinlock, called a reader-writer spinlock for this case. These locks have a type of rwlock_t and should be initialized to RW_LOCK_UNLOCKED. Any number of threads can hold the lock for reading at the same time. When a writer comes along, however, it waits until it can get exclusive access.

The functions for working with reader-writer locks are as follows:

read_lock(rwlock_t *lock); , read_lock_irqsave(rwlock_t *lock, unsigned long flags); , read_lock_irq(rwlock_t *lock); , read_lock_bh(rwlock_t *lock);

function in the same way as regular spinlocks.

read_unlock(rwlock_t *lock); , read_unlock_irqrestore(rwlock_t *lock, unsigned long flags); , read_unlock_irq(rwlock_t *lock); , read_unlock_bh(rwlock_t *lock);

These are the various ways of releasing a read lock.

write_lock(rwlock_t *lock); , write_lock_irqsave(rwlock_t *lock, unsigned long flags); , write_lock_irq(rwlock_t *lock); , write_lock_bh(rwlock_t *lock);

Acquire a lock as a writer.

write_unlock(rwlock_t *lock); , write_unlock_irqrestore(rwlock_t *lock, unsigned long flags); , write_unlock_irq(rwlock_t *lock); , write_unlock_bh(rwlock_t *lock);

Release a lock that was acquired as a writer.

If your interrupt handler uses read locks only, then all of your code may acquire read locks with read_lock and not disable interrupts. Any write locks must be acquired with write_lock_irqsave, however, to avoid deadlocks.

It is worth noting that in kernels built for uniprocessor systems, the spinlock functions expand to nothing. They thus have no overhead (other than possibly disabling interrupts) on those systems, where they are not needed.

Using Lock Variables

The kernel provides a set of functions that may be used to provide atomic (noninterruptible) access to variables. Use of these functions can occasionally eliminate the need for a more complicated locking scheme, when the operations to be performed are very simple. The atomic operations may also be used to provide a sort of “poor person’s spinlock” by manually testing and looping. It is usually better, however, to use spinlocks directly, since they have been optimized for this purpose.

The Linux kernel exports two sets of functions to deal with locks: bit operations and access to the “atomic” data type.

Bit operations

It’s quite common to have single-bit lock variables or to update device status flags at interrupt time—while a process may be accessing them. The kernel offers a set of functions that modify or test single bits atomically. Because the whole operation happens in a single step, no interrupt (or other processor) can interfere.

Atomic bit operations are very fast, since they perform the operation using a single machine instruction without disabling interrupts whenever the underlying platform can do that. The functions are architecture dependent and are declared in <asm/bitops.h>. They are guaranteed to be atomic even on SMP computers and are useful to keep coherence across processors.

Unfortunately, data typing in these functions is architecture dependent as well. The nr argument is mostly defined as int but is unsigned long for a few architectures. Here is the list of bit operations as they appear in 2.1.37 and later:

void set_bit(nr, void *addr);

This function sets bit number nr in the data item pointed to by addr. The function acts on an unsigned long, even though addr is a pointer to void.

void clear_bit(nr, void *addr);

The function clears the specified bit in the unsigned long datum that lives at addr. Its semantics are otherwise the same as set_bit.

void change_bit(nr, void *addr);

This function toggles the bit.

test_bit(nr, void *addr);

This function is the only bit operation that doesn’t need to be atomic; it simply returns the current value of the bit.

int test_and_set_bit(nr, void *addr); , int test_and_clear_bit(nr, void *addr); , int test_and_change_bit(nr, void *addr);

These functions behave atomically like those listed previously, except that they also return the previous value of the bit.

When these functions are used to access and modify a shared flag, you don’t have to do anything except call them. Using bit operations to manage a lock variable that controls access to a shared variable, on the other hand, is more complicated and deserves an example. Most modern code will not use bit operations in this way, but code like the following still exists in the kernel.

A code segment that needs to access a shared data item tries to atomically acquire a lock using either test_and_set_bit or test_and_clear_bit. The usual implementation is shown here; it assumes that the lock lives at bit nr of address addr. It also assumes that the bit is either 0 when the lock is free or nonzero when the lock is busy.

/* try to set lock */
while (test_and_set_bit(nr, addr) != 0)

/* do your work */

/* release lock, and check... */
if (test_and_clear_bit(nr, addr) == 0)
    something_went_wrong(); /* already released: error */

If you read through the kernel source, you will find code that works like this example. As mentioned before, however, it is better to use spinlocks in new code, unless you need to perform useful work while waiting for the lock to be released (e.g., in the wait_for_a_while() instruction of this listing).

Atomic integer operations

Kernel programmers often need to share an integer variable between an interrupt handler and other functions. A separate set of functions has been provided to facilitate this sort of sharing; they are defined in <asm/atomic.h>.

The facility offered by atomic.h is much stronger than the bit operations just described. atomic.h defines a new data type, atomic_t, which can be accessed only through atomic operations. An atomic_t holds an int value on all supported architectures. Because of the way this type works on some processors, however, the full integer range may not be available; thus, you should not count on an atomic_t holding more than 24 bits. The following operations are defined for the type and are guaranteed to be atomic with respect to all processors of an SMP computer. The operations are very fast because they compile to a single machine instruction whenever possible.

void atomic_set(atomic_t *v, int i);

Set the atomic variable v to the integer value i.

int atomic_read(atomic_t *v);

Return the current value of v.

void atomic_add(int i, atomic_t *v);

Add i to the atomic variable pointed to by v. The return value is void, because most of the time there’s no need to know the new value. This function is used by the networking code to update statistics about memory usage in sockets.

void atomic_sub(int i, atomic_t *v);

Subtract i from *v.

void atomic_inc(atomic_t *v); , void atomic_dec(atomic_t *v);

Increment or decrement an atomic variable.

int atomic_inc_and_test(atomic_t *v); , int atomic_dec_and_test(atomic_t *v); , int atomic_add_and_test(int i, atomic_t *v); , int atomic_sub_and_test(int i, atomic_t *v);

These functions behave like their counterparts listed earlier, but they also return the previous value of the atomic data type.

As stated earlier, atomic_t data items must be accessed only through these functions. If you pass an atomic item to a function that expects an integer argument, you’ll get a compiler error.

Going to Sleep Without Races

The one race condition that has been omitted so far in this discussion is the problem of going to sleep. Generally stated, things can happen in the time between when your driver decides to sleep and when the sleep_on call is actually performed. Occasionally, the condition you are sleeping for may come about before you actually go to sleep, leading to a longer sleep than expected. It is a problem far more general than interrupt-driven I/O, and an efficient solution requires a little knowledge of the internals of sleep_on.

As an example, consider again the following code from the short driver:

while (short_head == short_tail) {
    /* ... */

In this case, the value of short_head could change between the test in the while statement and the call to interruptible_sleep_on. In that case, the driver will sleep even though new data is available; this condition leads to delays in the best case, and a lockup of the device in the worst.

The way to solve this problem is to go halfway to sleep before performing the test. The idea is that the process can add itself to the wait queue, declare itself to be sleeping, and then perform its tests. This is the typical implementation:

wait_queue_t wait;
init_waitqueue_entry(&wait, current);

add_wait_queue(&short_queue, &wait);
while (1) {
    if (short_head != short_tail) /* whatever test your driver needs */
remove_wait_queue(&short_queue, &wait);

This code is somewhat like an unrolling of the internals of sleep_on; we’ll step through it here.

The code starts by declaring a wait_queue_t variable, initializing it, and adding it to the driver’s wait queue (which, as you may remember, is of type wait_queue_head_t). Once these steps have been performed, a call to wake_up on short_queue will wake this process.

The process is not yet asleep, however. It gets closer to that state with the call to set_current_state, which sets the process’s state to TASK_INTERRUPTIBLE. The rest of the system now thinks that the process is asleep, and the scheduler will not try to run it. This is an important step in the “going to sleep” process, but things still are not done.

What happens now is that the code tests for the condition for which it is waiting, namely, that there is data in the buffer. If no data is present, a call to schedule is made, causing some other process to run and truly putting the current process to sleep. Once the process is woken up, it will test for the condition again, and possibly exit from the loop.

Beyond the loop, there is just a bit of cleaning up to do. The current state is set to TASK_RUNNING to reflect the fact that we are no longer asleep; this is necessary because if we exited the loop without ever sleeping, we may still be in TASK_INTERRUPTIBLE. Then remove_wait_queue is used to take the process off the wait queue.

So why is this code free of race conditions? When new data comes in, the interrupt handler will call wake_up on short_queue, which has the effect of setting the state of every sleeping process on the queue to TASK_RUNNING. If the wake_up call happens after the buffer has been tested, the state of the task will be changed and schedule will cause the current process to continue running—after a short delay, if not immediately.

This sort of “test while half asleep” pattern is so common in the kernel source that a pair of macros was added during 2.1 development to make life easier:

wait_event(wq, condition); , wait_event_interruptible(wq, condition);

Both of these macros implement the code just discussed, testing the condition (which, since this is a macro, is evaluated at each iteration of the loop) in the middle of the “going to sleep” process.

[38] Note, however, that the kernel developers are seriously considering making all kernel code preemptable at almost any time, making locking mandatory even on uniprocessor systems.

Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.