This chapter delves into the area of Linux memory management, with an emphasis on techniques that are useful to the device driver writer. The material in this chapter is somewhat advanced, and not everybody will need a grasp of it. Nonetheless, many tasks can only be done through digging more deeply into the memory management subsystem; it also provides an interesting look into how an important part of the kernel works.
The material in this chapter is divided into three sections. The
first covers the implementation of the mmap
system call, which allows the mapping of device memory directly into a
user process’s address space. We then cover the kernel
kiobuf mechanism, which provides direct access to
user memory from kernel space. The
may be used to implement “raw I/O” for certain kinds of devices.
The final section covers direct memory access (DMA) I/O operations,
which essentially provide peripherals with direct access to system
Of course, all of these techniques require an understanding of how Linux memory management works, so we start with an overview of that subsystem.
Rather than describing the theory of memory management in operating systems, this section tries to pinpoint the main features of the Linux implementation of the theory. Although you do not need to be a Linux virtual memory guru to implement mmap, a basic overview of how things work is useful. What follows is a fairly lengthy description of the data structures used by the kernel to manage memory. Once the necessary background has been covered, we can get into working with these structures.
Linux is, of course, a virtual memory system, meaning that the addresses seen by user programs do not directly correspond to the physical addresses used by the hardware. Virtual memory introduces a layer of indirection, which allows a number of nice things. With virtual memory, programs running on the system can allocate far more memory than is physically available; indeed, even a single process can have a virtual address space larger than the system’s physical memory. Virtual memory also allows playing a number of tricks with the process’s address space, including mapping in device memory.
Thus far, we have talked about virtual and physical addresses, but a number of the details have been glossed over. The Linux system deals with several types of addresses, each with its own semantics. Unfortunately, the kernel code is not always very clear on exactly which type of address is being used in each situation, so the programmer must be careful.
The following is a list of address types used in Linux. Figure 13-1 shows how these address types relate to physical memory.
These are the regular addresses seen by user-space programs. User addresses are either 32 or 64 bits in length, depending on the underlying hardware architecture, and each process has its own virtual address space.
The addresses used between peripheral buses and memory. Often they are the same as the physical addresses used by the processor, but that is not necessarily the case. Bus addresses are highly architecture dependent, of course.
These make up the normal address space of the kernel. These addresses
map most or all of main memory, and are often treated as if they were
physical addresses. On most architectures, logical addresses and their
associated physical addresses differ only by a constant
offset. Logical addresses use the hardware’s native pointer size, and
thus may be unable to address all of physical memory on heavily
equipped 32-bit systems. Logical addresses are usually stored in
variables of type
unsigned long or
void *. Memory returned from kmalloc has a
These differ from logical addresses in that they do not necessarily have a direct mapping to physical addresses. All logical addresses are kernel virtual addresses; memory allocated by vmalloc also has a virtual address (but no direct physical mapping). The function kmap, described later in this chapter, also returns virtual addresses. Virtual addresses are usually stored in pointer variables.
If you have a logical address, the macro
__pa() (defined in
<asm/page.h>) will return its associated
physical address. Physical addresses can be mapped back to logical
addresses with __va(), but only for
Different kernel functions require different types of addresses. It would be nice if there were different C types defined so that the required address type were explicit, but we have no such luck. In this chapter, we will be clear on which types of addresses are used where.
The difference between logical and kernel virtual addresses is highlighted on 32-bit systems that are equipped with large amounts of memory. With 32 bits, it is possible to address 4 GB of memory. Linux on 32-bit systems has, until recently, been limited to substantially less memory than that, however, because of the way it sets up the virtual address space. The system was unable to handle more memory than it could set up logical addresses for, since it needed directly mapped kernel addresses for all memory.
Recent developments have eliminated the limitations on memory, and 32-bit systems can now work with well over 4 GB of system memory (assuming, of course, that the processor itself can address that much memory). The limitation on how much memory can be directly mapped with logical addresses remains, however. Only the lowest portion of memory (up to 1 or 2 GB, depending on the hardware and the kernel configuration) has logical addresses; the rest (high memory) does not. High memory can require 64-bit physical addresses, and the kernel must set up explicit virtual address mappings to manipulate it. Thus, many kernel functions are limited to low memory only; high memory tends to be reserved for user-space process pages.
The term “high memory” can be confusing to some, especially since it has other meanings in the PC world. So, to make things clear, we’ll define the terms here:
Memory for which logical addresses exist in kernel space. On almost every system you will likely encounter, all memory is low memory.
Memory for which logical addresses do not exist, because the system contains more physical memory than can be addressed with 32 bits.
On i386 systems, the boundary between low and high memory is usually set at just under 1 GB. This boundary is not related in any way to the old 640 KB limit found on the original PC. It is, instead, a limit set by the kernel itself as it splits the 32-bit address space between kernel and user space.
We will point out high-memory limitations as we come to them in this chapter.
Historically, the kernel has used logical addresses to refer to
explicit pages of memory. The addition of high-memory support,
however, has exposed an obvious problem with that
approach—logical addresses are not available for high memory.
Thus kernel functions that deal with memory are increasingly using
struct page instead. This data
structure is used to keep track of just about everything the kernel
needs to know about physical memory; there is one
struct page for each physical page on the system. Some of the
fields of this structure include the following:
The number of references there are to this page. When the count drops to zero, the page is returned to the free list.
A list of processes waiting on this page. Processes can wait on a page when a kernel function has locked it for some reason; drivers need not normally worry about waiting on pages, though.
The kernel virtual address of the page, if it is mapped;
NULL, otherwise. Low-memory pages are always
mapped; high-memory pages usually are not.
unsigned long flags;
A set of bit flags describing the status of the page. These include
PG_locked, which indicates that the page has been
locked in memory, and
PG_reserved, which prevents
the memory management system from working with the page at all.
There is much more information within
but it is part of the deeper black magic of memory management and is
not of concern to driver writers.
The kernel maintains one or more arrays of
struct page entries, which track all of the physical memory on the
system. On most systems, there is a single array, called
mem_map. On some systems, however, the situation
is more complicated. Nonuniform memory access (NUMA) systems and
those with widely discontiguous physical memory may have more than one
memory map array, so code that is meant to be portable should avoid
direct access to the array whenever possible. Fortunately, it is
usually quite easy to just work with
pointers without worrying about where they come from.
Some functions and macros are defined for translating between
struct page pointers and virtual addresses:
struct page *virt_to_page(void *kaddr);
This macro, defined in
<asm/page.h>, takes a
kernel logical address and returns its associated
struct page pointer. Since it requires a logical address, it will
not work with memory from vmalloc or high memory.
void *page_address(struct page *page);
void *kmap(struct page *page);,
void kunmap(struct page *page);
kmap returns a kernel virtual address for any page in the system. For low-memory pages, it just returns the logical address of the page; for high-memory pages, kmap creates a special mapping. Mappings created with kmap should always be freed with kunmap; a limited number of such mappings is available, so it is better not to hold on to them for too long. kmap calls are additive, so if two or more functions both call kmap on the same page the right thing happens. Note also that kmap can sleep if no mappings are available.
When a program looks up a virtual address, the CPU must convert the address to a physical address in order to access physical memory. The step is usually performed by splitting the address into bitfields. Each bitfield is used as an index into an array, called a page table, to retrieve either the address of the next table or the address of the physical page that holds the virtual address.
The Linux kernel manages three levels of page tables in order to map virtual addresses to physical addresses. The multiple levels allow the memory range to be sparsely populated; modern systems will spread a process out across a large range of virtual memory. It makes sense to do things that way; it allows for runtime flexibility in how things are laid out.
Note that Linux uses a three-level system even on hardware that only
supports two levels of page tables or hardware that uses a different
way to map virtual addresses to physical ones. The use of three
levels in a processor-independent implementation allows Linux to
support both two-level and three-level processors without clobbering
the code with a lot of
#ifdef statements. This
kind of conservative coding doesn’t lead to additional overhead when
the kernel runs on two-level processors, because the compiler actually
optimizes out the unused level.
It is time to take a look at the data structures used to implement the paging system. The following list summarizes the implementation of the three levels in Linux, and Figure 13-2 depicts them.
The top-level page table. The PGD is an array
pgd_t items, each of which points to a
second-level page table. Each process has its own page directory, and
there is one for kernel space as well. You can think of the page
directory as a page-aligned array of
The second-level table. The PMD is a
page-aligned array of
pmd_t items. A
pmd_t is a pointer to the third-level page
table. Two-level processors have no physical PMD; they declare their
PMD as an array with a single element, whose value is the PMD
itself—we’ll see in a while how this is handled in C and how the
compiler optimizes this level away.
A page-aligned array of items, each of which is called a
Page Table Entry. The kernel uses the
for the items. A
pte_t contains the physical
address of the data page.
The kernel doesn’t need to worry about doing page-table lookups during normal program execution, because they are done by the hardware. Nonetheless, the kernel must arrange things so that the hardware can do its work. It must build the page tables and look them up whenever the processor reports a page fault, that is, whenever the page associated with a virtual address needed by the processor is not present in memory. Device drivers, too, must be able to build page tables and handle faults when implementing mmap.
It’s interesting to note how software memory management exploits the same page tables that are used by the CPU itself. Whenever a CPU doesn’t implement page tables, the difference is only hidden in the lowest levels of architecture-specific code. In Linux memory management, therefore, you always talk about three-level page tables irrespective of whether they are known to the hardware or not. An example of a CPU family that doesn’t use page tables is the PowerPC. PowerPC designers implemented a hash algorithm that maps virtual addresses into a one-level page table. When accessing a page that is already in memory but whose physical address has expired from the CPU caches, the CPU needs to read memory only once, as opposed to the two or three accesses required by a multilevel page table approach. The hash algorithm, like multilevel tables, makes it possible to reduce use of memory in mapping virtual addresses to physical ones.
Irrespective of the mechanisms used by the CPU, the Linux software
implementation is based on three-level page tables, and the following
symbols are used to access them. Both
<asm/pgtable.h> must be included for all of
them to be accessible.
unsigned pgd_val(pgd_t pgd),
unsigned pmd_val(pmd_t pmd),
unsigned pte_val(pte_t pte)
These three macros are used to retrieve the
unsigned value from the typed data item. The actual
type used varies depending on the underlying architecture and kernel
configuration options; it is usually either
unsigned long or, on 32-bit processors supporting high memory,
unsigned long long. SPARC64 processors use
unsigned int. The macros help in using strict data
typing in source code without introducing computational overhead.
pgd_t * pgd_offset(struct mm_struct * mm, unsigned long address),
pmd_t * pmd_offset(pgd_t * dir, unsigned long address),
pte_t * pte_offset(pmd_t * dir, unsigned long address)
These inline functions are used to retrieve the
pte entries associated
address. Page-table lookup begins with a
struct mm_struct. The pointer associated
with the memory map of the current process is
current->mm, while the pointer to kernel space
is described by
&init_mm. Two-level processors
(pmd_t *)dir, thus folding the
pmd over the
pgd. Functions that scan page tables are always
inline, and the compiler optimizes out
struct page *pte_page(pte_t pte)
This function returns a pointer to the
entry for the page in this page-table entry. Code that deals with
page-tables will generally want to use pte_page
rather than pte_val, since
pte_page deals with the processor-dependent
format of the page-table entry and returns the
struct page pointer, which is usually what’s needed.
This macro returns a boolean value that indicates whether the data
page is currently in memory. This is the most used of several
functions that access the low bits in the
pte—the bits that are discarded by
pte_page. Pages may be absent, of course, if the
kernel has swapped them to disk (or if they have never been loaded).
The page tables themselves, however, are always present in the current
Linux implementation. Keeping page tables in memory simplifies the
kernel code because pgd_offset and friends never
fail; on the other hand, even a process with a “resident storage
size” of zero keeps its page tables in real RAM, wasting some memory
that might be better used elsewhere.
Each process in the system has a
structure, which contains its page tables and a great many other
things. It also contains a spinlock called
page_table_lock, which should be held while
traversing or modifying the page tables.
Just seeing the list of these functions is not enough for you to
be proficient in the Linux memory management algorithms; real memory
management is much more complex and must deal with other
complications, like cache coherence. The previous list should
nonetheless be sufficient to give you a feel for how page management
is implemented; it is also about all that you will need to know, as a
device driver writer, to work occasionally with page tables. You can
get more information from the
mm subtrees of the kernel source.
Although paging sits at the lowest level of memory management, something more is necessary before you can use the computer’s resources efficiently. The kernel needs a higher-level mechanism to handle the way a process sees its memory. This mechanism is implemented in Linux by means of virtual memory areas, which are typically referred to as areas or VMAs.
An area is a homogeneous region in the virtual memory of a process, a contiguous range of addresses with the same permission flags. It corresponds loosely to the concept of a “segment,” although it is better described as “a memory object with its own properties.” The memory map of a process is made up of the following:
An area for the program’s executable code (often called text).
One area each for data, including initialized data (that which has an explicitly assigned value at the beginning of execution), uninitialized data (BSS), and the program stack.
One area for each active memory mapping.
The memory areas of a process can be seen by looking in
pid, of course, is replaced by a
/proc/self is a special case of
pid, because it always
refers to the current process. As an example, here are a couple of
memory maps, to which we have added short comments after a sharp sign:
cat /proc/1/maps# look at init 08048000-0804e000 r-xp 00000000 08:01 51297 /sbin/init # text 0804e000-08050000 rw-p 00005000 08:01 51297 /sbin/init # data 08050000-08054000 rwxp 00000000 00:00 0 # zero-mapped bss 40000000-40013000 r-xp 00000000 08:01 39003 /lib/ld-2.1.3.so # text 40013000-40014000 rw-p 00012000 08:01 39003 /lib/ld-2.1.3.so # data 40014000-40015000 rw-p 00000000 00:00 0 # bss for ld.so 4001b000-40108000 r-xp 00000000 08:01 39006 /lib/libc-2.1.3.so # text 40108000-4010c000 rw-p 000ec000 08:01 39006 /lib/libc-2.1.3.so # data 4010c000-40110000 rw-p 00000000 00:00 0 # bss for libc.so bfffe000-c0000000 rwxp fffff000 00:00 0 # zero-mapped stack morgana.root# rsh wolf head /proc/self/maps #### alpha-axp: static ecoff 000000011fffe000-0000000120000000 rwxp 0000000000000000 00:00 0 # stack 0000000120000000-0000000120014000 r-xp 0000000000000000 08:03 2844 # text 0000000140000000-0000000140002000 rwxp 0000000000014000 08:03 2844 # data 0000000140002000-0000000140008000 rwxp 0000000000000000 00:00 0 # bss
The fields in each line are as follows:
start-end perm offset major:minor inode image.
Each field in
/proc/*/maps (except the image
name) corresponds to a field in
struct vm_area_struct, and is described in the following list.
The beginning and ending virtual addresses for this memory area.
A bit mask with the memory area’s read, write, and execute
permissions. This field describes what the process is allowed to do
with pages belonging to the area. The last character in the field is
p for “private” or
Where the memory area begins in the file that it is mapped to. An offset of zero, of course, means that the first page of the memory area corresponds to the first page of the file.
The major and minor numbers of the device holding the file that has been mapped. Confusingly, for device mappings, the major and minor numbers refer to the disk partition holding the device special file that was opened by the user, and not the device itself.
The inode number of the mapped file.
The name of the file (usually an executable image) that has been mapped.
A driver that implements the mmap method needs to fill a VMA structure in the address space of the process mapping the device. The driver writer should therefore have at least a minimal understanding of VMAs in order to use them.
Let’s look at the most important fields in
struct vm_area_struct (defined in
<linux/mm.h>). These fields may be used by
device drivers in their mmap implementation. Note
that the kernel maintains lists and trees of VMAs to optimize area
lookup, and several fields of
used to maintain this organization. VMAs thus can’t be created at
will by a driver, or the structures will break. The main fields of
VMAs are as follows (note the similarity between these fields and the
/proc output we just saw):
unsigned long vm_start;,
unsigned long vm_end;
The virtual address range covered by this VMA. These fields are the
first two fields shown in
struct file *vm_file;
A pointer to the
struct file structure associated
with this area (if any).
unsigned long vm_pgoff;
The offset of the area in the file, in pages. When a file or device is mapped, this is the file position of the first page mapped in this area.
unsigned long vm_flags;
A set of flags describing this area. The flags of the most interest
to device driver writers are
VM_IO marks a VMA
as being a memory-mapped I/O region. Among other things, the
VM_IO flag will prevent the region from being
included in process core dumps.
the memory management system not to attempt to swap out this VMA; it
should be set in most device mappings.
struct vm_operations_struct *vm_ops;
A set of functions that the kernel may invoke to operate on this
memory area. Its presence indicates that the memory area is a kernel
“object” like the
struct file we have been using
throughout the book.
A field that may be used by the driver to store its own information.
struct vm_area_struct, the
vm_operations_struct is defined in
<linux/mm.h>; it includes the operations
listed next. These operations are the only ones needed to handle the
process’s memory needs, and they are listed in the order they are
declared. Later in this chapter, some of these functions will be
implemented; they will be described more completely at that point.
void (*open)(struct vm_area_struct *vma);
The open method is called by the kernel to allow the subsystem implementing the VMA to initialize the area, adjust reference counts, and so forth. This method will be invoked any time that a new reference to the VMA is made (when a process forks, for example). The one exception happens when the VMA is first created by mmap; in this case, the driver’s mmap method is called instead.
void (*close)(struct vm_area_struct *vma);
void (*unmap)(struct vm_area_struct *vma, unsigned long addr, size_t len);
void (*protect)(struct vm_area_struct *vma, unsigned long, size_t, unsigned int newprot);
This method is intended to change the protection on a memory area, but is currently not used. Memory protection is handled by the page tables, and the kernel sets up the page-table entries separately.
int (*sync)(struct vm_area_struct *vma, unsigned long, size_t, unsigned int flags);
struct page *(*nopage)(struct vm_area_struct *vma, unsigned long address, int write_access);
When a process tries to access a page that belongs to a valid VMA, but
that is currently not in memory, the nopage
method is called (if it is defined) for the related area. The method
struct page pointer for the physical
page, after, perhaps, having read it in from secondary storage. If
the nopage method isn’t defined for the area, an
empty page is allocated by the kernel. The third argument,
write_access, counts as “no-share”: a nonzero
value means the page must be owned by the current process, whereas
0 means that sharing is possible.
struct page *(*wppage)(struct vm_area_struct *vma, unsigned long address, struct page *page);
This method handles write-protected page faults but is currently
unused. The kernel handles attempts to write over a protected page
without invoking the area-specific callback. Write-protect faults are
used to implement copy-on-write. A private page can be shared across
processes until one process writes to it. When that happens, the page
is cloned, and the process writes on its own copy of the page. If the
whole area is marked as read-only, a
sent to the process, and the copy-on-write is not performed.
int (*swapout)(struct page *page, struct file *file);
This method is called when a page is selected to be swapped out. A
return value of 0 signals success; any other value signals an
error. In case of error, the process owning the page is sent a
SIGBUS. It is highly unlikely that a driver will
ever need to implement swapout; device mappings
are not something that the kernel can just write to disk.
 On 32-bit SPARC processors, the
functions are not
inline but rather real
extern functions, which are not exported to
modularized code. Therefore you won’t be able to use these functions
in a module running on the SPARC, but you won’t usually need
 The name BSS is a historical relic, from an old assembly operator meaning “Block started by symbol.” The BSS segment of executable files isn’t stored on disk, and the kernel maps the zero page to the BSS address range.