Interrupt Handling

As we explained earlier, most exceptions are handled simply by sending a Unix signal to the process that caused the exception. The action to be taken is thus deferred until the process receives the signal; as a result, the kernel is able to process the exception quickly.

This approach does not hold for interrupts because they frequently arrive long after the process to which they are related (for instance, a process that requested a data transfer) has been suspended and a completely unrelated process is running. So it would make no sense to send a Unix signal to the current process.

Interrupt handling depends on the type of interrupt. For our purposes, we’ll distinguish three main classes of interrupts:

I/O interrupts: Some I/O devices require attention; the corresponding interrupt handler must query the device to determine the proper course of action. We cover this type of interrupt in the later section Section 4.6.1.
Timer interrupts: Some timer, either a local APIC timer or an external timer, has issued an interrupt; this kind of interrupt tells the kernel that a fixed-time interval has elapsed. These interrupts are handled mostly as I/O interrupts; we discuss the peculiar characteristics of timer interrupts in Chapter 6.
Interprocessor interrupts: A CPU issued an interrupt to another CPU of a multiprocessor system. We cover such interrupts in the later section Section 4.6.2.

I/O Interrupt Handling

In general, an I/O interrupt handler must be flexible enough to service several devices at the same time. In the PCI bus architecture, for instance, several devices may share the same IRQ line. This means that the interrupt vector alone does not tell the whole story. In the example shown in Table 4-3, the same vector 43 is assigned to the USB port and to the sound card. However, some hardware devices found in older PC architectures (like ISA) do not reliably operate if their IRQ line is shared with other devices.

Interrupt handler flexibility is achieved in two distinct ways, as discussed in the following list.

IRQ sharing: The interrupt handler executes several interrupt service routines (ISRs). Each ISR is a function related to a single device sharing the IRQ line. Since it is not possible to know in advance which particular device issued the IRQ, each ISR is executed to verify whether its device needs attention; if so, the ISR performs all the operations that need to be executed when the device raises an interrupt.
IRQ dynamic allocation: An IRQ line is associated with a device at the last possible moment; for instance, the IRQ line of the floppy device is allocated only when a user accesses the floppy disk device. In this way, the same IRQ vector may be used by several hardware devices even if they cannot share the IRQ line, although not at the same time.

Not all actions to be performed when an interrupt occurs have the same urgency. In fact, the interrupt handler itself is not a suitable place for all kind of actions. Long noncritical operations should be deferred, since while an interrupt handler is running, the signals on the corresponding IRQ line are temporarily ignored. Most important, the process on behalf of which an interrupt handler is executed must always stay in the TASK_RUNNING state, or a system freeze can occur. Therefore, interrupt handlers cannot perform any blocking procedure such as an I/O disk operation. Linux divides the actions to be performed following an interrupt into three classes:

Critical: Actions such as acknowledging an interrupt to the PIC, reprogramming the PIC or the device controller, or updating data structures accessed by both the device and the processor. These can be executed quickly and are critical because they must be performed as soon as possible. Critical actions are executed within the interrupt handler immediately, with maskable interrupts disabled.
Noncritical: Actions such as updating data structures that are accessed only by the processor (for instance, reading the scan code after a keyboard key has been pushed). These actions can also finish quickly, so they are executed by the interrupt handler immediately, with the interrupts enabled.
Noncritical deferrable: Actions such as copying a buffer’s contents into the address space of some process (for instance, sending the keyboard line buffer to the terminal handler process). These may be delayed for a long time interval without affecting the kernel operations; the interested process will just keep waiting for the data. Noncritical deferrable actions are performed by means of separate functions that are discussed in the later section Section 4.7.

Regardless of the kind of circuit that caused the interrupt, all I/O interrupt handlers perform the same four basic actions:

Save the IRQ value and the registers contents in the Kernel Mode stack.
Send an acknowledgment to the PIC that is servicing the IRQ line, thus allowing it to issue further interrupts.
Execute the interrupt service routines (ISRs) associated with all the devices that share the IRQ.
Terminate by jumping to the ret_from_intr( ) address.

Several descriptors are needed to represent both the state of the IRQ lines and the functions to be executed when an interrupt occurs. Figure 4-3 represents in a schematic way the hardware circuits and the software functions used to handle an interrupt. These functions are discussed in the following sections.

Figure 4-3. I/O interrupt handling

Interrupt vectors

As illustrated in Table 4-2, physical IRQs may be assigned any vector in the range 32-238. However, Linux uses vector 128 to implement system calls.

The IBM-compatible PC architecture requires that some devices be statically connected to specific IRQ lines. In particular:

The interval timer device must be connected to the IRQ0 line (see Chapter 6).
The slave 8259A PIC must be connected to the IRQ2 line (although more advanced PICs are now being used, Linux still supports 8259A-style PICs).
The external mathematical coprocessor must be connected to the IRQ13 line (although recent 80 × 86 processors no longer use such a device, Linux continues to support the hardy 80386 model).
In general, an I/O device can be connected to a limited number of IRQ lines. (As a matter of fact, when playing with an old PC where IRQ sharing is not possible, you might not succeed in installing a new card because of IRQ conflicts with other already present hardware devices.)

Table 4-2. Interrupt vectors in Linux

Vector range	Use
0-19 `(0x0-0x13)`	Nonmaskable interrupts and exceptions
20-31 `(0x14`-`0x1f)`	Intel-reserved
32-127 `(0x20`-`0x7f)`	External interrupts (IRQs)
128 `(0x80)`	Programmed exception for system calls (see Chapter 9)
129-238 `(0x81`-`0xee)`	External interrupts (IRQs)
239 `(0xef)`	Local APIC timer interrupt (see Chapter 6)
240-250 `(0xf0`-`0xfa)`	Reserved by Linux for future use
251-255 `(0xfb`-`0xff)`	Interprocessor interrupts (see Section 4.6.2 later in this chapter)

There are three ways to select a line for an IRQ-configurable device:

By setting some hardware jumpers (only on very old device cards).
By a utility program shipped with the device and executed when installing it. Such a program may either ask the user to select an available IRQ number or probe the system to determine an available number by itself.
By a hardware protocol executed at system startup. Peripheral devices declare which interrupt lines they are ready to use; the final values are then negotiated to reduce conflicts as much as possible. Once this is done, each interrupt handler can read the assigned IRQ by using a function that accesses some I/O ports of the device. For instance, drivers for devices that comply with the Peripheral Component Interconnect (PCI) standard use a group of functions such as pci_read_config_byte( ) to access the device configuration space.

Table 4-3 shows a fairly arbitrary arrangement of devices and IRQs, such as those that might be found on one particular PC.

Table 4-3. An example of IRQ assignment to I/O devices

IRQ	INT	Hardware Device
0	32	Timer
1	33	Keyboard
2	34	PIC cascading
3	35	Second serial port
4	36	First serial port
6	38	Floppy disk
8	40	System clock
10	42	Network interface
11	43	USB port, sound card
12	44	PS/2 mouse
13	45	Mathematical coprocessor
14	46	EIDE disk controller’s first chain
15	47	EIDE disk controller’s second chain

The kernel must discover the correspondence between the IRQ number and the I/O device before enabling interrupts. Otherwise, how could the kernel handle a signal from, for example, a SCSI disk without knowing which vector corresponds to the device? The correspondence is established while initializing each device driver (see Chapter 13).

IRQ data structures

As always, when discussing complicated operations involving state transitions, it helps to understand first where key data is stored. Thus, this section explains the data structures that support interrupt handling and how they are laid out in various descriptors. Figure 4-4 illustrates schematically the relationships between the main descriptors that represent the state of the IRQ lines. (The figure does not illustrate the data structures needed to handle softirqs, tasklets, and bottom halves; they are discussed later in this chapter.)

Figure 4-4. IRQ descriptors

An irq _desc array groups together NR_IRQS (usually 224) irq _desc_t descriptors, which include the following fields:

status: A set of flags describing the IRQ line status (see Table 4-4).

Table 4-4. Flags describing the IRQ line status

Flag name	Description
`IRQ_INPROGRESS`	A handler for the IRQ is being executed.
`IRQ_DISABLED`	The IRQ line has been deliberately disabled by a device driver.
`IRQ_PENDING`	An IRQ has occurred on the line; its occurrence has been acknowledged to the PIC, but it has not yet been serviced by the kernel.
`IRQ_REPLAY`	The IRQ line has been disabled but the previous IRQ occurrence has not yet been acknowledged to the PIC.
`IRQ_AUTODETECT`	The kernel uses the IRQ line while performing a hardware device probe.
`IRQ_WAITING`	The kernel uses the IRQ line while performing a hardware device probe; moreover, the corresponding interrupt has not been raised.
`IRQ_LEVEL`	Not used on the 80 × 86 architecture.
`IRQ_MASKED`	Not used.
`IRQ_PER_CPU`	Not used on the 80 × 86 architecture.

handler: Points to the hw_interrupt_type descriptor that identifies the PIC circuit servicing the IRQ line.
action: Identifies the interrupt service routines to be invoked when the IRQ occurs. The field points to the first element of the list of irqaction descriptors associated with the IRQ. The irqaction descriptor is described later in the chapter.
depth: Shows 0 if the IRQ line is enabled and a positive value if it has been disabled at least once. Every time the disable_irq( ) or disable_irq_nosync( ) function is invoked, the field is incremented; if depth was equal to 0, the function disables the IRQ line and sets its IRQ_DISABLED flag.^[28] Conversely, each invocation of the enable_irq( ) function decrements the field; if depth becomes 0, the function enables the IRQ line and clears its IRQ_DISABLED flag.
lock: A spin lock used to serialize the accesses to the IRQ descriptor (see Chapter 5).

During system initialization, the init_IRQ( ) function sets the status field of each IRQ main descriptor to IRQ _DISABLED. Moreover, init_IRQ( ) updates the IDT by replacing the provisional interrupt gates with new ones. This is accomplished through the following statements:

for (i = 0; i < NR_IRQS; i++)
   if (i+32 != 128)
       set_intr_gate(i+32,interrupt[i]);

This code looks in the interrupt array to find the interrupt handler addresses that it uses to set up the interrupt gates. The interrupt handler for IRQn is named IRQ n _interrupt( ) (see the later section Section 4.6.1.4).

Some of the interrupt gates will never be used; others will be used only in multiprocessor systems; finally, some of them are always used. Thus, some of the interrupt gates are set to their final values, while others aren’t. More precisely:

The gates of the first 16 IRQs (vectors 32-47) are set to their final values.
In multiprocessor systems, the gates of the interprocessor interrupts and the gate of the local APIC timer interrupt are also set properly (see Section 4.6.1.7 later in this chapter).
Vector 128 is left untouched, since it is used for the system call’s programmed exception.
All remaining gates are reserved for interrupts issued from devices connected to a PCI bus. In this case, the handler field of the irq_desc element is initialized to the no_irq_type null handler.

In addition to the 8259A chip that was mentioned near the beginning of this chapter, Linux supports several other PIC circuits such as the SMP IO-APIC, PIIX4’s internal 8259 PIC, and SGI’s Visual Workstation Cobalt (IO-)APIC. To handle all such devices in a uniform way, Linux uses a “PIC object,” consisting of the PIC name and seven PIC standard methods. The advantage of this object-oriented approach is that drivers need not to be aware of the kind of PIC installed in the system. Each driver-visible interrupt source is transparently wired to the appropriate controller. The data structure that defines a PIC object is called hw_interrupt_type (also called hw_irq_controller).

For the sake of concreteness, let’s assume that our computer is a uniprocessor with two 8259A PICs, which provide 16 standard IRQs. In this case, the handler field in each of the 16 irq _desc_t descriptors points to the i8259A_irq _type variable, which describes the 8259A PIC. This variable is initialized as follows:

struct hw_interrupt_type i8259A_irq_type = { 
    "XT-PIC", 
    startup_8259A_irq, 
    shutdown_8259A_irq, 
    enable_8259A_irq, 
    disable_8259A_irq,
    mask_and_ack_8259A,
    end_8259A_irq,
    NULL
};

The first field in this structure, "XT-PIC", is the PIC name. Next come the pointers to six different functions used to program the PIC. The first two functions start up and shut down an IRQ line of the chip, respectively. But in the case of the 8259A chip, these functions coincide with the third and fourth functions, which enable and disable the line. The mask_and_ack_8259A( ) function acknowledges the IRQ received by sending the proper bytes to the 8259A I/O ports. The end_8259A_irq( ) function is invoked when the interrupt handler for the IRQ line terminates. The last set_affinity method is set to NULL: it is used in multiprocessor systems to declare the “affinity” of CPUs for specified IRQs — that is, which CPUs are enabled to handle specific IRQs.

As described earlier, multiple devices can share a single IRQ. Therefore, the kernel maintains irqaction descriptors, each of which refers to a specific hardware device and a specific interrupt. The descriptor includes the following fields:

handler: Points to the interrupt service routine for an I/O device. This is the key field that allows many devices to share the same IRQ.
flags: Describes the relationships between the IRQ line and the I/O device (see Table 4-5).

Table 4-5. Flags of the irqaction descriptor

Flag name	Description
`SA_INTERRUPT`	The handler must execute with interrupts disabled.
`SA_SHIRQ`	The device permits its IRQ line to be shared with other devices.
`SA_SAMPLE_RANDOM`	The device may be considered a source of events that occurs randomly; it can thus be used by the kernel random number generator. (Users can access this feature by taking random numbers from the `/dev/random` and `/dev/urandom` device files.)

name: The name of the I/O device (shown when listing the serviced IRQs by reading the /proc/interrupts file).
dev_id: A private field for the I/O device. Typically, it identifies the I/O device itself (for instance, it could be equal to its major and minor numbers; see Section 13.2), or it points to a device driver’s data.
next: Points to the next element of a list of irqaction descriptors. The elements in the list refer to hardware devices that share the same IRQ.

Finally, the irq_stat array includes NR_CPUS entries, one for each CPU in the system. Each entry is of type irq_cpustat_t, and includes a few counters and flags used by the kernel to keep track of what any CPU is currently doing. The most important fields are usually accessed through some macros having as a parameter the CPU logical number (that is, the index of the array).

In particular, the local_irq_count(n) macro selects the _ _local_irq_count field of the n ^th entry of the array. The field is a counter of how many interrupt handlers are stacked in the CPU — that is, how many interrupt handlers have been started and are not yet terminated.

IRQ distribution in multiprocessor systems

Linux sticks to the Symmetric Multiprocessing model (SMP); this means, essentially, that the kernel should not have any bias toward one CPU with respect to the others. As a consequence, the kernel tries to distribute the IRQ signals coming from the hardware devices in a round-robin fashion among all the CPUs. Therefore, all the CPUs spend approximately the same fraction of their execution time servicing I/O interrupts.

In the earlier section Section 4.2.1.1, we said that the multi-APIC system has sophisticated mechanisms to dynamically distribute the IRQ signals among the CPUs. Therefore, the Linux kernel has to do very little to enforce the round-robin distribution scheme.

During system bootstrap, the booting CPU executes the setup_IO_APIC_irqs( ) function to initialize the I/O APIC chip. The 24 entries of the Interrupt Redirection Table of the chip are filled so that all IRQ signals from the I/O hardware devices can be routed to each CPU in the system according to the “lowest priority” scheme. During system bootstrap, moreover, all CPUs execute the setup_local_APIC( ) function, which takes care of initializing the local APICs. In particular, the task priority register (TPR) of each chip is initialized to a fixed value, meaning that the CPU is willing to handle any kind of IRQ signal, regardless of its priority. The Linux kernel never modifies this value after its initialization.

Since all task priority registers contain the same value, all CPUs always have the same priority. To break tie, the multi-APIC system uses the values in the arbitration priority registers of local APICs, as explained earlier. Since such values are automatically changed after every interrupt, the IRQ signals are fairly distributed among all CPUs.^[29]

In short, when a hardware device raises an IRQ signal, the multi-APIC system selects one of the CPUs and delivers the signal to the corresponding local APIC, which in turn interrupts its CPU. All other CPUs are not notified of the event. All this is magically done by the hardware, so it is of no concern for the kernel after multi-APIC system initialization.

Saving the registers for the interrupt handler

When a CPU receives an interrupt, it starts executing the code at the address found in the corresponding gate of the IDT (see the earlier section Section 4.2.4).

As with other context switches, the need to save registers leaves the kernel developer with a somewhat messy coding job because the registers have to be saved and restored using assembly language code. However, within those operations, the processor is expected to call and return from a C function. In this section, we describe the assembly language task of handling registers; in the next, we show some of the acrobatics required in the C function that is subsequently invoked.

Saving registers is the first task of the interrupt handler. As already mentioned, the interrupt handler for IRQn is named IRQ n _interrupt, and its address is included in the interrupt gate stored in the proper IDT entry.

In uniprocessor systems, the same BUILD_IRQ macro is duplicated 16 times, once for each IRQ number, in order to yield 16 different interrupt handler entry points. In multiprocessor systems, the macro is duplicated 14 × 16 times for a grand total of 224 interrupt handler entry points. Each macro occurrence expands to the following assembly language fragment:

IRQn_interrupt:
    pushl $n-256
    jmp common_interrupt

The result is to save on the stack the IRQ number associated with the interrupt minus 256.^[30]

The same code for all interrupt handlers can then be executed while referring to this number. The common code can be found in the BUILD_COMMON_IRQ macro, which expands to the following assembly language fragment:

common_interrupt: 
    SAVE_ALL 
    call do_IRQ
    jmp $ret_from_intr

The SAVE_ALL macro, in turn, expands to the following fragment:

cld 
push %es 
push %ds 
pushl %eax 
pushl %ebp 
pushl %edi 
pushl %esi 
pushl %edx 
pushl %ecx 
pushl %ebx 
movl $_ _KERNEL_DS,%edx 
movl %edx,%ds 
movl %edx,%es

SAVE_ALL saves all the CPU registers that may be used by the interrupt handler on the stack, except for eflags, cs, eip, ss, and esp, which are already saved automatically by the control unit (see the earlier section Section 4.2.4). The macro then loads the selector of the kernel data segment into ds and es.

After saving the registers, BUILD_COMMON_IRQ invokes the do_IRQ( ) function. Then, when the ret instruction of do_IRQ( ) is executed (when that function terminates) control is transferred to ret_from_intr( ) (see the later section Section 4.8).

The do_IRQ( ) function

The do_IRQ( ) function is invoked to execute all interrupt service routines associated with an interrupt. When it starts, the kernel stack contains, from the top down:

The do_IRQ( )’s return address (the starting address of ret_from_intr( ))
The group of register values pushed on by SAVE_ALL
The encoding of the IRQ number
The registers saved automatically by the control unit when it recognized the interrupt

Since the C compiler places all the parameters on top of the stack, the do_IRQ( ) function is declared as follows:

unsigned int do_IRQ(struct pt_regs regs)

where the pt_regs structure consists of 15 fields:

The first nine fields are the register values pushed by SAVE_ALL.
The tenth field, referenced through a field called orig_eax, encodes the IRQ number.
The remaining fields correspond to the register values pushed on automatically by the control unit.^[31]

The do_IRQ( ) function is equivalent to the following code fragment. Don’t be scared by this function — we are going to explain the code line by line.

int irq = regs.orig_eax & 0xff;  
spin_lock(&(irq_desc[irq].lock));
irq_desc[irq].handler->ack(irq);
irq_desc[irq].status &= ~(IRQ_REPLAY | IRQ_WAITING);
irq_desc[irq].status |= IRQ_PENDING;
if (!(irq_desc[irq].status & (IRQ_DISABLED | IRQ_INPROGRESS)) && irq_desc[irq].action) {
    irq_desc[irq].status |= IRQ_INPROGRESS;
    do {
        irq_desc[irq].status &= ~IRQ_PENDING;
        spin_unlock(&(irq_desc[irq].lock));
        handle_IRQ_event(irq, &regs, irq_desc[irq].action);
        spin_lock(&(irq_desc[irq].lock));
    } while (irq_desc[irq].status & IRQ_PENDING);
    irq_desc[irq].status &= ~IRQ_INPROGRESS;
} 
irq_desc[irq].handler->end(irq);
spin_unlock(&(irq_desc[irq].lock));  
if (softirq_pending(smp_processor_id( )))
    do_softirq( );

First of all, the do_IRQ( ) function gets the IRQ vector passed as a parameter on the stack and puts it in the irq local variable. This value is used as an index to access the proper element of the irq_desc array (the IRQ main descriptor).

Before accessing the main IRQ descriptor, the kernel acquires the corresponding spin lock. We’ll see in Chapter 5 that the spin lock protects against concurrent accesses by different CPUs (in a uniprocessor system, the spin_lock( ) function does nothing). This spin lock is necessary in a multiprocessor system because other interrupts of the same kind may be raised, and other CPUs might take care of the new interrupt occurrences. Without the spin lock, the main IRQ descriptor would be accessed concurrently by several CPUs. As we’ll see, this situation must be absolutely avoided.

After acquiring the spin lock, the function invokes the ack method of the main IRQ descriptor. In a uniprocessor system, the corresponding mask_and_ack_8259A( ) function acknowledges the interrupt on the PIC and also disables the IRQ line. Masking the IRQ line ensures that the CPU does not accept further occurrences of this type of interrupt until the handler terminates. Remember that the do_IRQ( ) function runs with local interrupts disabled; in fact, the CPU control unit automatically clears the IF flag of the eflags register because the interrupt handler is invoked through an IDT’s interrupt gate. However, we’ll see shortly that the kernel might re-enable local interrupts before executing the interrupt service routines of this interrupt.

In a multiprocessor system, however, things are much more complicated. Depending on the type of interrupt, acknowledging the interrupt could either be done by the ack method or delayed until the interrupt handler terminates (that is, acknowledgement could be done by the end method). In either case, we can take for granted that the local APIC doesn’t accept further interrupts of this type until the handler terminates, although further occurrences of this type of interrupt may be accepted by other CPUs (main IRQ descriptor’s spin lock comes to the rescue!).

The do_IRQ( ) function then initializes a few flags of the main IRQ descriptor. It sets the IRQ_PENDING flag because the interrupt has been acknowledged (well, sort of), but not yet really serviced; it also clears the IRQ_WAITING and IRQ_REPLAY flags (but we don’t have to care about them now).

Now do_IRQ( ) checks whether it must really handle the interrupt. There are three cases in which nothing has to be done. These are discussed in the following list.

IRQ_DISABLED is set: A CPU might execute the do_IRQ( ) function even if the corresponding IRQ line is disabled; you’ll find an explanation for this nonintuitive case in the later section Section 4.6.1.6. Moreover, buggy motherboards may generate spurious interrupts even when the IRQ line is disabled in the PIC.
IRQ_INPROGRESS is set: In a multiprocessor system, another CPU might be handling a previous occurrence of the same interrupt. Why not defer the handling of this occurrence to that CPU? This is exactly what is done by Linux. This leads to a simpler kernel architecture because device drivers’ interrupt service routines need not to be reentrant (their execution is serialized). Moreover, the freed CPU can quickly return to what it was doing, without dirtying its hardware cache; this is beneficial to system performances. The IRQ_INPROGRESS flag is set whenever a CPU is committed to execute the interrupt service routines of the interrupt; therefore, the do_IRQ( ) function checks it before starting the real work.
irc_desc[irq].action is NULL: This case occurs when there is no interrupt service routines associated with the interrupt. Normally, this happens only when the kernel is probing a hardware device.

Let’s suppose that none of the three cases holds, so the interrupt has to be serviced. do_IRQ( ) sets the IRQ_INPROGRESS flag and starts a loop. In each iteration, the function clears the IRQ_PENDING flag, releases the interrupt spin lock, and executes the interrupt services routines by invoking handle_IRQ_event( )(described in the later section Section 4.6.1.7). When the latter function terminates, do_IRQ( ) acquires the spin lock again and checks the value of the IRQ_PENDING flag. If it is clear, no further occurrence of the interrupt has been delivered to another CPU, so the loop ends. Conversely, if IRQ_PENDING is set, another CPU has executed the do_IRQ( ) function for this type of interrupt while this CPU was executing handle_IRQ_event( ). Therefore, do_IRQ( ) performs another iteration of the loop, servicing the new occurrence of the interrupt.^[32]

Our do_IRQ( ) function is now going to terminate, either because it has already executed the interrupt service routines or because it had nothing to do. The function invokes the end method of the main IRQ descriptor. On uniprocessor systems, the corresponding end_8259A_irq( ) function re-enables the IRQ line (unless the interrupt occurrence was spurious). On multiprocessor systems, the end method acknowledges the interrupt (if not already done by the ack method).

Finally, do_IRQ( ) releases the spin lock: the hard work is finished! Before returning, however, the function checks whether deferrable kernel functions are waiting to be executed (see Section 4.7 later in this chapter). In the affirmative case, it invokes the do_softirq( ) function. When do_IRQ( ) terminates, the control is transferred to the ret_from_intr( ) function.

Reviving a lost interrupt

The do_IRQ( ) function is small and simple, yet it works properly in most cases. Indeed, the IRQ_PENDING, IRQ_INPROGRESS, and IRQ_DISABLED flags ensure that interrupts are correctly handled even when the hardware is misbehaving. However, things may not work so smoothly in a multiprocessor system.

Suppose that a CPU has an IRQ line enabled. A hardware device raises the IRQ line, and the multi-APIC system selects our CPU for handling the interrupt. Before the CPU acknowledges the interrupt, the IRQ line is masked out by another CPU; as a consequence, the IRQ_DISABLED flag is set. Right afterwards, our CPU starts handling the pending interrupt; therefore, the do_IRQ( ) function acknowledges the interrupt and then returns without executing the interrupt service routines because it finds the IRQ_DISABLED flag set. Therefore, the interrupt occurred before IRQ line disabling, yet it got lost.

To cope with this scenario, when the enable_irq( ) function re-enables the IRQ line, it forces the hardware to generate a new occurrence of the lost interrupt:

spin_lock_irqsave(&(irq_desc[irq].lock), flags);
if (--irq_desc[irq].depth == 0) {
    irq_desc[irq].status &= ~IRQ_DISABLED;
    if (irq_desc[irq].status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
        irq_desc[irq].status |= IRQ_REPLAY;
        send_IPI_self(irq+32);
    }
    irq_desc[irq].handler->enable(irq);
} 
spin_lock_irqrestore(&(irq_desc[irq].lock), flags);

The function detects that an interrupt was lost by checking the value of the IRQ_PENDING flag. The flag is always cleared when leaving the interrupt handler; therefore, if the IRQ line is disabled and the flag is set, then an interrupt occurrence has been acknowledged but not yet serviced. In this case it is necessary to issue a new interrupt. This is obtained by forcing the local APIC to generate a self-interrupt (see the later section Section 4.6.2). The role of the IRQ_REPLAY flag is to ensure that exactly one self-interrupt is generated. Remember that the do_IRQ( ) function clears that flag when it starts handling the interrupt.

Interrupt service routines

As mentioned previously, an interrupt service routine implements a device-specific operation. When an interrupt handler must execute the ISRs, it invokes the handle_IRQ_event( ) function. This function essentially performs the steps shown in the following list.

Invokes the irq_enter( ) function to increment the _ _local_irq_count field of the irq_stat entry of the executing CPU (to learn how many interrupt handlers are stacked in the CPU, see the earlier section Section 4.6.1.2). As we shall see in Chapter 5, this function also checks that interrupts are not globally disabled.
Enables the local interrupts with the sti assembly language instruction if the SA_INTERRUPT flag is clear.
Executes each interrupt service routine of the interrupt through the following code:
```
do {
    action->handler(irq, action->dev_id, regs);
    action = action->next;
} while (action);
```
At the start of the loop, action points to the start of a list of irqaction data structures that indicate the actions to be taken upon receiving the interrupt (see Figure 4-4 earlier in this chapter).
Disables the local interrupts with the cli assembly language instruction.
Invokes irq_exit( ) to decrement the _ _local_irq_count field of the irq_stat entry of the executing CPU.

All interrupt service routines act on the same parameters:

irq: The IRQ number
dev_id: The device identifier
regs: A pointer to the Kernel Mode stack area containing the registers saved right after the interrupt occurred

The first parameter allows a single ISR to handle several IRQ lines, the second one allows a single ISR to take care of several devices of the same type, and the last one allows the ISR to access the execution context of the interrupted kernel control path. In practice, most ISRs do not use these parameters.

The SA_INTERRUPT flag of the main IRQ descriptor determines whether interrupts must be enabled or disabled when the do_IRQ( ) function invokes an ISR. An ISR that has been invoked with the interrupts in one state is allowed to put them in the opposite state. In a uniprocessor system, this can be achieved by means of the cli (disable interrupts) and sti (enable interrupts) assembly language instructions. Globally enabling or disabling interrupts in a multiprocessor system is a much more complicated task; we’ll deal with it in Chapter 5.

The structure of an ISR depends on the characteristics of the device handled. We’ll give a few examples of ISRs in Chapter 6, Chapter 13, and Chapter 18.

Dynamic allocation of IRQ lines

As noticed in section Section 4.6.1.1, a few vectors are reserved for specific devices, while the remaining ones are dynamically handled. There is, therefore, a way in which the same IRQ line can be used by several hardware devices even if they do not allow IRQ sharing. The trick is to serialize the activation of the hardware devices so that just one owns the IRQ line at a time.

Before activating a device that is going to use an IRQ line, the corresponding driver invokes request_irq( ). This function creates a new irqaction descriptor and initializes it with the parameter values; it then invokes the setup_irq( ) function to insert the descriptor in the proper IRQ list. The device driver aborts the operation if setup_irq( ) returns an error code, which means that the IRQ line is already in use by another device that does not allow interrupt sharing. When the device operation is concluded, the driver invokes the free_irq( ) function to remove the descriptor from the IRQ list and release the memory area.

Let’s see how this scheme works on a simple example. Assume a program wants to address the /dev/fd0 device file, which corresponds to the first floppy disk on the system.^[33]

The program can do this either by directly accessing /dev/fd0 or by mounting a filesystem on it. Floppy disk controllers are usually assigned IRQ 6; given this, the floppy driver issues the following request:

request_irq(6, floppy_interrupt, 
            SA_INTERRUPT|SA_SAMPLE_RANDOM, "floppy", NULL);

As can be observed, the floppy_interrupt( ) interrupt service routine must execute with the interrupts disabled (SA_INTERRUPT set) and no sharing of the IRQ (SA_SHIRQ flag cleared). The SA_SAMPLE_RANDOM flag set means that accesses to the floppy disk are a good source of random events to be used for the kernel random number generator. When the operation on the floppy disk is concluded (either the I/O operation on /dev/fd0 terminates or the filesystem is unmounted), the driver releases IRQ 6:

free_irq(6, NULL);

To insert an irqaction descriptor in the proper list, the kernel invokes the setup_irq( ) function, passing to it the parameters irq _nr, the IRQ number, and new (the address of a previously allocated irqaction descriptor). This function:

Checks whether another device is already using the irq _nr IRQ and, if so, whether the SA_SHIRQ flags in the irqaction descriptors of both devices specify that the IRQ line can be shared. Returns an error code if the IRQ line cannot be used.
Adds *new (the new irqaction descriptor pointed to by new) at the end of the list to which irq _desc[irq _nr]->action points.
If no other device is sharing the same IRQ, clears the IRQ _DISABLED, IRQ_AUTODETECT, and IRQ _INPROGRESS flags in the flags field of *new and invokes the startup method of the irq_desc[irq_nr]->handler PIC object to make sure that IRQ signals are enabled.

Here is an example of how setup_irq( ) is used, drawn from system initialization. The kernel initializes the irq0 descriptor of the interval timer device by executing the following instructions in the time_init( ) function (see Chapter 6):

struct irqaction irq0  = 
    {timer_interrupt, SA_INTERRUPT, 0, "timer", NULL,}; 
setup_irq(0, &irq0);

First, the irq0 variable of type irqaction is initialized: the handler field is set to the address of the timer_interrupt( ) function, the flags field is set to SA_INTERRUPT, the name field is set to "timer“, and the last field is set to NULL to show that no dev_id value is used. Next, the kernel invokes setup_irq( ) to insert irq0 in the list of irqaction descriptors associated with IRQ0.

Interprocessor Interrupt Handling

On multiprocessor systems, Linux defines the following five kinds of interprocessor interrupts (see also Table 4-2):

CALL_FUNCTION_VECTOR (vector 0xfb): Sent to all CPUs but the sender, forcing those CPUs to run a function passed by the sender. The corresponding interrupt handler is named call_function_interrupt( ). The function passed as a parameter may, for instance, force all other CPUs to stop, or may force them to set the contents of the Memory Type Range Registers (MTRRs).^[34] Usually this interrupt is sent to all CPUs except the CPU executing the calling function by means of the smp_call_function( ) facility function.
RESCHEDULE_VECTOR (vector 0xfc): When a CPU receives this type of interrupt, the corresponding handler — named reschedule_interrupt( ) — limits itself to acknowledge the interrupt. All the rescheduling is done automatically when returning from the interrupt (see Section 4.8 later in this chapter).
INVALIDATE_TLB_VECTOR (vector 0xfd): Sent to all CPUs but the sender, forcing them to invalidate their Translation Lookaside Buffers. The corresponding handler, named invalidate_interrupt( ), flushes some TLB entries of the processor as described in Section 2.5.7.
ERROR_APIC_VECTOR (vector 0xfe): This interrupt should never occur.
SPURIOUS_APIC_VECTOR (vector 0xff): This interrupt should never occur.

Thanks to the following group of functions, issuing interprocessor interrupts (IPIs) becomes an easy task:

send_IPI_all( ): Sends an IPI to all CPUs (including the sender)
send_IPI_allbutself( ): Sends an IPI to all CPUs except the sender
send_IPI_self( ): Sends an IPI to the sender CPU
send_IPI_mask( ): Sends an IPI to a group of CPUs specified by a bit mask

The assembly language code of the interprocessor interrupt handlers is generated by the BUILD_SMP_INTERRUPT macro; the code is almost identical to the code generated by the BUILD_IRQ macro (see the earlier section Section 4.6.1.4).

Each interprocessor interrupt has a different high-level handler, which has the same name as the low-level handler preceded by smp_. For instance, the high-level handler of the RESCHEDULE_VECTOR interprocessor interrupt that is invoked by the low-level reschedule_interrupt( ) handler is named smp_reschedule_interrupt( ). Each high-level handler acknowledges the interprocessor interrupt on the local APIC and then performs the specific action triggered by the interrupt.

^[28]Contrary to disable_irq_nosync( ), disable_irq(n) waits until all interrupt handlers for IRQn that are running on other CPUs have completed before returning.

^[29]There is an exception, though. Linux usually sets up the local APICs in such a way to honor the focus processor . When an IRQ signal is raised, the focus processor for that IRQ is the CPU to which a previous occurrence of the same IRQ has been already sent; moreover, either the interrupt is still pending (waiting to be handled) or the CPU is still servicing the corresponding interrupt handler. When focus mode is enabled, an interrupt is always sent to its focus processor, if it exists. However, Intel has dropped support for focus processors in the Pentium 4 model.

^[30]Subtracting 256 from an IRQ number yields a negative number. Positive numbers are reserved to identify system calls (see Chapter 9).

^[31]The ret_from_intr( ) return address is missing from the pt_regs structure because the C compiler expects a return address on top of the stack. It takes this into account when generating the instructions to address parameters.

^[32]Because IRQ_PENDING is a flag and not a counter, only the second occurrence of the interrupt can be recognized. Further occurrences in each iteration of the do_IRQ( )’s loop are simply lost.

^[33]Floppy disks are “old” devices that do not usually allow IRQ sharing.

^[34]Starting with the Pentium Pro model, Intel microprocessors include these additional registers to easily customize cache operations. For instance, Linux may use these registers to disable the hardware cache for the addresses mapping the frame buffer of a PCI/AGP graphic card while maintaining the “write combining” mode of operation: the paging unit combines write transfers into larger chunks before copying them into the frame buffer.

Get Understanding the Linux Kernel, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Understanding the Linux Kernel, Second Edition by