Unix kernels provide an execution environment in which applications may run. Therefore, the kernel must implement a set of services and corresponding interfaces. Applications use those interfaces and do not usually interact directly with hardware resources.
As already mentioned, a CPU can run in either User Mode or Kernel Mode. Actually, some CPUs can have more than two execution states. For instance, the 80 × 86 microprocessors have four different execution states. But all standard Unix kernels use only Kernel Mode and User Mode.
When a program is executed in User Mode, it cannot directly access the kernel data structures or the kernel programs. When an application executes in Kernel Mode, however, these restrictions no longer apply. Each CPU model provides special instructions to switch from User Mode to Kernel Mode and vice versa. A program usually executes in User Mode and switches to Kernel Mode only when requesting a service provided by the kernel. When the kernel has satisfied the program’s request, it puts the program back in User Mode.
Processes are dynamic entities that usually have a limited life span within the system. The task of creating, eliminating, and synchronizing the existing processes is delegated to a group of routines in the kernel.
The kernel itself is not a process but a process manager. The process/kernel model assumes that processes that require a kernel service use specific programming constructs called system calls . Each system call sets up the group of parameters that identifies the process request and then executes the hardware-dependent CPU instruction to switch from User Mode to Kernel Mode.
They run in Kernel Mode in the kernel address space.
They do not interact with users, and thus do not require terminal devices.
They are usually created during system startup and remain alive until the system is shut down.
On a uniprocessor system, only one process is running at a time and it may run either in User or in Kernel Mode. If it runs in Kernel Mode, the processor is executing some kernel routine. Figure 1-3 illustrates examples of transitions between User and Kernel Mode. Process 1 in User Mode issues a system call, after which the process switches to Kernel Mode and the system call is serviced. Process 1 then resumes execution in User Mode until a timer interrupt occurs and the scheduler is activated in Kernel Mode. A process switch takes place and Process 2 starts its execution in User Mode until a hardware device raises an interrupt. As a consequence of the interrupt, Process 2 switches to Kernel Mode and services the interrupt.
Unix kernels do much more than handle system calls; in fact, kernel routines can be activated in several ways:
A process invokes a system call.
A peripheral device issues an interrupt signal to the CPU to notify it of an event such as a request for attention, a status change, or the completion of an I/O operation. Each interrupt signal is dealt by a kernel program called an interrupt handler . Since peripheral devices operate asynchronously with respect to the CPU, interrupts occur at unpredictable times.
A kernel thread is executed. Since it runs in Kernel Mode, the corresponding program must be considered part of the kernel.
When the kernel stops the execution of a process, it saves the current contents of several processor registers in the process descriptor. These include:
The program counter (PC) and stack pointer (SP) registers
The general purpose registers
The floating point registers
The processor control registers (Processor Status Word) containing information about the CPU state
The memory management registers used to keep track of the RAM accessed by the process
When the kernel decides to resume executing a process, it uses the proper process descriptor fields to load the CPU registers. Since the stored value of the program counter points to the instruction following the last instruction executed, the process resumes execution at the point where it was stopped.
When a process is not executing on the CPU, it is waiting for some event. Unix kernels distinguish many wait states, which are usually implemented by queues of process descriptors; each (possibly empty) queue corresponds to the set of processes waiting for a specific event.
All Unix kernels are reentrant . This means that several processes may be executing in Kernel Mode at the same time. Of course, on uniprocessor systems, only one process can progress, but many can be blocked in Kernel Mode when waiting for the CPU or the completion of some I/O operation. For instance, after issuing a read to a disk on behalf of some process, the kernel lets the disk controller handle it, and resumes executing other processes. An interrupt notifies the kernel when the device has satisfied the read, so the former process can resume the execution.
One way to provide reentrancy is to write functions so that they modify only local variables and do not alter global data structures. Such functions are called reentrant functions . But a reentrant kernel is not limited just to such reentrant functions (although that is how some real-time kernels are implemented). Instead, the kernel can include nonreentrant functions and use locking mechanisms to ensure that only one process can execute a nonreentrant function at a time. Every process in Kernel Mode acts on its own set of memory locations and cannot interfere with the others.
If a hardware interrupt occurs, a reentrant kernel is able to suspend the current running process even if that process is in Kernel Mode. This capability is very important, since it improves the throughput of the device controllers that issue interrupts. Once a device has issued an interrupt, it waits until the CPU acknowledges it. If the kernel is able to answer quickly, the device controller will be able to perform other tasks while the CPU handles the interrupt.
Now let’s look at kernel reentrancy and its impact on the organization of the kernel. A kernel control path denotes the sequence of instructions executed by the kernel to handle a system call, an exception, or an interrupt.
In the simplest case, the CPU executes a kernel control path sequentially from the first instruction to the last. When one of the following events occurs, however, the CPU interleaves the kernel control paths:
A process executing in User Mode invokes a system call, and the corresponding kernel control path verifies that the request cannot be satisfied immediately; it then invokes the scheduler to select a new process to run. As a result, a process switch occurs. The first kernel control path is left unfinished and the CPU resumes the execution of some other kernel control path. In this case, the two control paths are executed on behalf of two different processes.
The CPU detects an exception—for example, access to a page not present in RAM—while running a kernel control path. The first control path is suspended, and the CPU starts the execution of a suitable procedure. In our example, this type of procedure can allocate a new page for the process and read its contents from disk. When the procedure terminates, the first control path can be resumed. In this case, the two control paths are executed on behalf of the same process.
A hardware interrupt occurs while the CPU is running a kernel control path with the interrupts enabled. The first kernel control path is left unfinished and the CPU starts processing another kernel control path to handle the interrupt. The first kernel control path resumes when the interrupt handler terminates. In this case, the two kernel control paths run in the execution context of the same process, and the total elapsed system time is accounted to it. However, the interrupt handler doesn’t necessarily operate on behalf of the process.
Figure 1-4 illustrates a few examples of noninterleaved and interleaved kernel control paths. Three different CPU states are considered:
Running a process in User Mode (User)
Running an exception or a system call handler (Excp)
Running an interrupt handler (Intr)
Each process runs in its private address space. A process running in User Mode refers to private stack, data, and code areas. When running in Kernel Mode, the process addresses the kernel data and code area and uses another stack.
Since the kernel is reentrant, several kernel control paths—each related to a different process—may be executed in turn. In this case, each kernel control path refers to its own private kernel stack.
While it appears to each process that it has access to a private address space, there are times when part of the address space is shared among processes. In some cases, this sharing is explicitly requested by processes; in others, it is done automatically by the kernel to reduce memory usage.
If the same program, say an editor, is needed simultaneously by several users, the program is loaded into memory only once, and its instructions can be shared by all of the users who need it. Its data, of course, must not be shared because each user will have separate data. This kind of shared address space is done automatically by the kernel to save memory.
Processes can also share parts of their address space as a kind of interprocess communication, using the “shared memory” technique introduced in System V and supported by Linux.
Finally, Linux supports the
mmap( ) system call,
which allows part of a file or the memory residing on a device to be
mapped into a part of a process address space.
mapping can provide an alternative to normal reads and writes for
transferring data. If the same file is shared by several processes,
its memory mapping is included in the address space of each of the
processes that share it.
Implementing a reentrant kernel requires the use of synchronization. If a kernel control path is suspended while acting on a kernel data structure, no other kernel control path should be allowed to act on the same data structure unless it has been reset to a consistent state. Otherwise, the interaction of the two control paths could corrupt the stored information.
For example, suppose a global variable V contains the number of available items of some system resource. The first kernel control path, A, reads the variable and determines that there is just one available item. At this point, another kernel control path, B, is activated and reads the same variable, which still contains the value 1. Thus, B decrements V and starts using the resource item. Then A resumes the execution; because it has already read the value of V, it assumes that it can decrement V and take the resource item, which B already uses. As a final result, V contains -1, and two kernel control paths use the same resource item with potentially disastrous effects.
In general, safe access to a global variable is ensured by using atomic operations . In the previous example, data corruption is not possible if the two control paths read and decrement V with a single, noninterruptible operation. However, kernels contain many data structures that cannot be accessed with a single operation. For example, it usually isn’t possible to remove an element from a linked list with a single operation because the kernel needs to access at least two pointers at once. Any section of code that should be finished by each process that begins it before another process can enter it is called a critical region .
These problems occur not only among kernel control paths, but also among processes sharing common data. Several synchronization techniques have been adopted. The following section concentrates on how to synchronize kernel control paths.
In search of a drastically simple solution to synchronization problems, most traditional Unix kernels are nonpreemptive: when a process executes in Kernel Mode, it cannot be arbitrarily suspended and substituted with another process. Therefore, on a uniprocessor system, all kernel data structures that are not updated by interrupts or exception handlers are safe for the kernel to access.
Of course, a process in Kernel Mode can voluntarily relinquish the CPU, but in this case, it must ensure that all data structures are left in a consistent state. Moreover, when it resumes its execution, it must recheck the value of any previously accessed data structures that could be changed.
Another synchronization mechanism for uniprocessor systems consists of disabling all hardware interrupts before entering a critical region and reenabling them right after leaving it. This mechanism, while simple, is far from optimal. If the critical region is large, interrupts can remain disabled for a relatively long time, potentially causing all hardware activities to freeze.
Moreover, on a multiprocessor system, this mechanism doesn’t work at all. There is no way to ensure that no other CPU can access the same data structures that are updated in the protected critical region.
A widely used mechanism, effective in both uniprocessor and multiprocessor systems, relies on the use of semaphores . A semaphore is simply a counter associated with a data structure; it is checked by all kernel threads before they try to access the data structure. Each semaphore may be viewed as an object composed of:
An integer variable
A list of waiting processes
Two atomic methods:
down( ) method decrements the value of the
semaphore. If the new value is less than 0, the method adds the
running process to the semaphore list and then blocks (i.e., invokes
the scheduler). The
up( ) method increments the
value of the semaphore and, if its new value is less than or equal
to 0, reactivates one or more processes in the semaphore list.
Each data structure to be protected has its own semaphore, which is
initialized to 1. When a kernel control path wishes to access the
data structure, it executes the
down( ) method on
the proper semaphore. If the value of the new semaphore
isn’t negative, access to the data structure is
granted. Otherwise, the process that is executing the kernel control
path is added to the semaphore list and blocked. When another process
up( ) method on that semaphore, one
of the processes in the semaphore list is allowed to proceed.
In multiprocessor systems, semaphores are not always the best solution to the synchronization problems. Some kernel data structures should be protected from being concurrently accessed by kernel control paths that run on different CPUs. In this case, if the time required to update the data structure is short, a semaphore could be very inefficient. To check a semaphore, the kernel must insert a process in the semaphore list and then suspend it. Since both operations are relatively expensive, in the time it takes to complete them, the other kernel control path could have already released the semaphore.
In these cases, multiprocessor operating systems use spin locks . A spin lock is very similar to a semaphore, but it has no process list; when a process finds the lock closed by another process, it “spins” around repeatedly, executing a tight instruction loop until the lock becomes open.
Of course, spin locks are useless in a uniprocessor environment. When a kernel control path tries to access a locked data structure, it starts an endless loop. Therefore, the kernel control path that is updating the protected data structure would not have a chance to continue the execution and release the spin lock. The final result would be that the system hangs.
Processes or kernel control paths that synchronize with other control paths may easily enter a deadlocked state. The simplest case of deadlock occurs when process p1 gains access to data structure a and process p2 gains access to b, but p1 then waits for b and p2 waits for a. Other more complex cyclic waits among groups of processes may also occur. Of course, a deadlock condition causes a complete freeze of the affected processes or kernel control paths.
As far as kernel design is concerned, deadlocks become an issue when the number of kernel semaphores used is high. In this case, it may be quite difficult to ensure that no deadlock state will ever be reached for all possible ways to interleave kernel control paths. Several operating systems, including Linux, avoid this problem by introducing a very limited number of semaphores and requesting semaphores in an ascending order.
a mechanism for notifying processes of system events. Each event has
its own signal number, which is usually referred to by a symbolic
constant such as
SIGTERM. There are two kinds of
- Asynchronous notifications
For instance, a user can send the interrupt signal
SIGINTto a foreground process by pressing the interrupt keycode (usually CTRL-C) at the terminal.
- Synchronous errors or exceptions
For instance, the kernel sends the signal
SIGSEGVto a process when it accesses a memory location at an illegal address.
The POSIX standard defines about 20 different signals, two of which are user-definable and may be used as a primitive mechanism for communication and synchronization among processes in User Mode. In general, a process may react to a signal delivery in two possible ways:
Ignore the signal.
Asynchronously execute a specified procedure (the signal handler).
Kernel signal handling is rather elaborate since the POSIX semantics
allows processes to temporarily block signals. Moreover, the
cannot be directly handled by the process or ignored.
AT&T’s Unix System V introduced other kinds of interprocess communication among processes in User Mode, which have been adopted by many Unix kernels: semaphores , message queues , and shared memory . They are collectively known as System V IPC .
The kernel implements these constructs as IPC resources
. A process acquires a resource by
semget( ), or
msgget( ) system call. Just like
files, IPC resources are persistent: they must be explicitly
deallocated by the creator process, by the current owner, or by a
Semaphores are similar to those described in Section 1.6.5, earlier in this chapter,
except that they are reserved for processes in User Mode. Message
queues allow processes to exchange messages by using the
msgsnd( ) and
msgget( ) system
calls, which insert a message into a specific message queue and
extract a message from it, respectively.
Shared memory provides the fastest way for processes to exchange and
share data. A process starts by issuing a
shmget( ) system call to create a new shared memory having a
required size. After obtaining the IPC resource identifier, the
process invokes the
shmat( ) system call, which
returns the starting address of the new region within the process
address space. When the process wishes to detach the shared memory
from its address space, it invokes the
system call. The implementation of shared memory depends on how the
kernel implements process address spaces.
Unix makes a neat distinction
between the process and the program it is executing. To that end, the
fork( ) and
_exit( ) system
calls are used respectively to create a new process and to terminate
it, while an
exec( )-like system call is invoked
to load a new program. After such a system call is executed, the
process resumes execution with a brand new address space containing
the loaded program.
The process that invokes a
fork( ) is the
while the new process is its
Parents and children can find one another because the data structure
describing each process includes a pointer to its immediate parent
and pointers to all its immediate children.
A naive implementation of the
fork( ) would
require both the parent’s data and the
parent’s code to be duplicated and assign the copies
to the child. This would be quite time consuming. Current kernels
that can rely on hardware paging units follow the Copy-On-Write
approach, which defers page duplication until the last moment (i.e.,
until the parent or the child is required to write into a page). We
shall describe how Linux implements this technique in
_exit( ) system call terminates a process. The
kernel handles this system call by releasing the resources owned by
the process and sending the parent process a
SIGCHLD signal, which is ignored by default.
can a parent process inquire about termination of its children? The
wait( ) system call allows a process to wait until
one of its children terminates; it returns the process ID (PID) of
the terminated child.
When executing this system call, the kernel checks whether a child
has already terminated. A special
process state is introduced to
represent terminated processes: a process remains in that state until
its parent process executes a
wait( ) system call
on it. The system call handler extracts data about resource usage
from the process descriptor fields; the process descriptor may be
released once the data is collected. If no child process has already
terminated when the
wait( ) system call is
executed, the kernel usually puts the process in a wait state until a
Many kernels also implement a
waitpid( ) system
call, which allows a process to wait for a specific child process.
Other variants of
wait( ) system calls are also
It’s good practice for the kernel to keep around
information on a child process until the parent issues its
wait( ) call, but suppose the parent process
terminates without issuing that call? The information takes up
valuable memory slots that could be used to serve living processes.
For example, many shells allow the user to start a command in the
background and then log out. The process that is running the command
shell terminates, but its children continue their execution.
The solution lies in a special system process called
, which is created during system
initialization. When a process terminates, the kernel changes the
appropriate process descriptor pointers of all the existing children
of the terminated process to make them become children of
init. This process monitors the execution of all
its children and routinely issues
wait( ) system
calls, whose side effect is to get rid of all zombies.
$ ls | sort | more
a shell that supports process groups, such as
bash, creates a new group for the three processes
more. In this way, the shell acts on the three
processes as if they were a single entity (the job, to be precise).
Each process descriptor includes a process group ID
field. Each group of processes may have a
, which is the process whose PID
coincides with the process group ID. A newly created process is
initially inserted into the process group of its parent.
Modern Unix kernels also introduce login sessions
. Informally, a login session contains
all processes that are descendants of the process that has started a
working session on a specific terminal—usually, the first
command shell process created for the user. All processes in a
process group must be in the same login session. A login session may
have several process groups active simultaneously; one of these
process groups is always in the foreground, which means that it has
access to the terminal. The other active process groups are in the
background. When a background process tries to access the terminal,
it receives a
SIGTTOUT signal. In many command shells, the
be used to put a process group in either the background or the
Memory management is by far the most complex activity in a Unix kernel. More than a third of this book is dedicated just to describing how Linux does it. This section illustrates some of the main issues related to memory management.
All recent Unix systems provide a useful abstraction called virtual memory . Virtual memory acts as a logical layer between the application memory requests and the hardware Memory Management Unit (MMU). Virtual memory has many purposes and advantages:
Several processes can be executed concurrently.
It is possible to run applications whose memory needs are larger than the available physical memory.
Processes can execute a program whose code is only partially loaded in memory.
Each process is allowed to access a subset of the available physical memory.
Processes can share a single memory image of a library or program.
Programs can be relocatable — that is, they can be placed anywhere in physical memory.
Programmers can write machine-independent code, since they do not need to be concerned about physical memory organization.
The main ingredient of a virtual memory subsystem is the notion of virtual address space . The set of memory references that a process can use is different from physical memory addresses. When a process uses a virtual address, the kernel and the MMU cooperate to locate the actual physical location of the requested memory item.
Today’s CPUs include hardware circuits that automatically translate the virtual addresses into physical ones. To that end, the available RAM is partitioned into page frames 4 or 8 KB in length, and a set of Page Tables is introduced to specify how virtual addresses correspond to physical addresses. These circuits make memory allocation simpler, since a request for a block of contiguous virtual addresses can be satisfied by allocating a group of page frames having noncontiguous physical addresses.
All Unix operating systems clearly distinguish between two portions of the random access memory (RAM). A few megabytes are dedicated to storing the kernel image (i.e., the kernel code and the kernel static data structures). The remaining portion of RAM is usually handled by the virtual memory system and is used in three possible ways:
To satisfy kernel requests for buffers, descriptors, and other dynamic kernel data structures
To satisfy process requests for generic memory areas and for memory mapping of files
To get better performance from disks and other buffered devices by means of caches
Each request type is valuable. On the other hand, since the available RAM is limited, some balancing among request types must be done, particularly when little available memory is left. Moreover, when some critical threshold of available memory is reached and a page-frame-reclaiming algorithm is invoked to free additional memory, which are the page frames most suitable for reclaiming? As we shall see in Chapter 16, there is no simple answer to this question and very little support from theory. The only available solution lies in developing carefully tuned empirical algorithms.
One major problem that must be solved by the virtual memory system is memory fragmentation . Ideally, a memory request should fail only when the number of free page frames is too small. However, the kernel is often forced to use physically contiguous memory areas, hence the memory request could fail even if there is enough memory available but it is not available as one contiguous chunk.
The Kernel Memory Allocator (KMA) is a subsystem that tries to satisfy the requests for memory areas from all parts of the system. Some of these requests come from other kernel subsystems needing memory for kernel use, and some requests come via system calls from user programs to increase their processes’ address spaces. A good KMA should have the following features:
It must be fast. Actually, this is the most crucial attribute, since it is invoked by all kernel subsystems (including the interrupt handlers).
It should minimize the amount of wasted memory.
It should try to reduce the memory fragmentation problem.
It should be able to cooperate with the other memory management subsystems to borrow and release page frames from them.
Several proposed KMAs, which are based on a variety of different algorithmic techniques, include:
Resource map allocator
Power-of-two free lists
Mach’s Zone allocator
Solaris’s Slab allocator
As we shall see in Chapter 7, Linux’s KMA uses a Slab allocator on top of a buddy system.
The address space of a process contains all the virtual memory
addresses that the process is allowed to reference. The kernel
usually stores a process virtual address space as a list of
memory area descriptors
. For example, when a process starts
the execution of some program via an
system call, the kernel assigns to the process a virtual address
space that comprises memory areas for:
The executable code of the program
The initialized data of the program
The uninitialized data of the program
The initial program stack (i.e., the User Mode stack)
The executable code and data of needed shared libraries
The heap (the memory dynamically requested by the program)
All recent Unix operating systems adopt a memory allocation strategy
called demand paging
. With demand paging, a process can start
program execution with none of its pages in physical memory. As it
accesses a nonpresent page, the MMU generates an exception; the
exception handler finds the affected memory region, allocates a free
page, and initializes it with the appropriate data. In a similar
fashion, when the process dynamically requires memory by using
malloc( ) or the
brk( ) system
call (which is invoked internally by
the kernel just updates the size of the heap memory region of the
process. A page frame is assigned to the process only when it
generates an exception by trying to refer its virtual memory
Virtual address spaces also allow other efficient strategies, such as the Copy-On-Write strategy mentioned earlier. For example, when a new process is created, the kernel just assigns the parent’s page frames to the child address space, but marks them read-only. An exception is raised as soon the parent or the child tries to modify the contents of a page. The exception handler assigns a new page frame to the affected process and initializes it with the contents of the original page.
To extend the size of the virtual address space usable by the processes, the Unix operating system uses swap areas on disk. The virtual memory system regards the contents of a page frame as the basic unit for swapping. Whenever a process refers to a swapped-out page, the MMU raises an exception. The exception handler then allocates a new page frame and initializes the page frame with its old contents saved on disk.
On the other hand, physical memory is also used as cache for hard
disks and other block devices. This is because hard drives are very
slow: a disk access requires several milliseconds, which is a very
long time compared with the RAM access time. Therefore, disks are
often the bottleneck in system performance. As a general rule, one of
the policies already implemented in the earliest Unix system is to
defer writing to disk as long as possible by loading into RAM a set
of disk buffers that correspond to blocks read from disk. The
sync( ) system call forces disk synchronization by
writing all of the “dirty” buffers
(i.e., all the buffers whose contents differ from that of the
corresponding disk blocks) into disk. To avoid data loss, all
operating systems take care to periodically write dirty buffers back
The kernel interacts with I/O devices by means of device drivers . Device drivers are included in the kernel and consist of data structures and functions that control one or more devices, such as hard disks, keyboards, mouses, monitors, network interfaces, and devices connected to a SCSI bus. Each driver interacts with the remaining part of the kernel (even with other drivers) through a specific interface. This approach has the following advantages:
Device-specific code can be encapsulated in a specific module.
Vendors can add new devices without knowing the kernel source code; only the interface specifications must be known.
The kernel deals with all devices in a uniform way and accesses them through the same interface.
It is possible to write a device driver as a module that can be dynamically loaded in the kernel without requiring the system to be rebooted. It is also possible to dynamically unload a module that is no longer needed, therefore minimizing the size of the kernel image stored in RAM.
Figure 1-5 illustrates how device drivers interface with the rest of the kernel and with the processes.
Some user programs (P) wish to operate on hardware devices. They make
requests to the kernel using the usual file-related system calls and
the device files normally found in the
directory. Actually, the device files are the user-visible portion of
the device driver interface. Each device file refers to a specific
device driver, which is invoked by the kernel to perform the
requested operation on the hardware component.
At the time Unix was introduced, graphical terminals were uncommon and expensive, so only alphanumeric terminals were handled directly by Unix kernels. When graphical terminals became widespread, ad hoc applications such as the X Window System were introduced that ran as standard processes and accessed the I/O ports of the graphics interface and the RAM video area directly. Some recent Unix kernels, such as Linux 2.4, provide an abstraction for the frame buffer of the graphic card and allow application software to access them without needing to know anything about the I/O ports of the graphics interface (see Section 13.3.1.)
 Synchronization problems have been fully described in other works; we refer the interested reader to books on the Unix operating systems (see the bibliography).