book

Understanding the Linux Kernel, 3rd Edition

by Daniel P. Bovet, Marco Cesati

November 2005

Beginner

942 pages

31h 13m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Understanding the Linux Kernel, 3rd Edition
Preface
The Audience for This Book
Organization of the Material
Level of Description
Overview of the Book
Background Information
Conventions in This Book
How to Contact Us
Safari® Enabled

Acknowledgments
1. Introduction
1.1. Linux Versus Other Unix-Like Kernels
1.2. Hardware Dependency
1.3. Linux Versions
1.4. Basic Operating System Concepts
1.4.1. Multiuser Systems1.4.2. Users and Groups1.4.3. Processes1.4.4. Kernel Architecture
1.5. An Overview of the Unix Filesystem
1.5.1. Files1.5.2. Hard and Soft Links1.5.3. File Types1.5.4. File Descriptor and Inode1.5.5. Access Rights and File Mode1.5.6. File-Handling System Calls1.5.6.1. Opening a file1.5.6.2. Accessing an opened file1.5.6.3. Closing a file1.5.6.4. Renaming and deleting a file
1.6. An Overview of Unix Kernels
1.6.1. The Process/Kernel Model1.6.2. Process Implementation1.6.3. Reentrant Kernels1.6.4. Process Address Space1.6.5. Synchronization and Critical Regions1.6.5.1. Kernel preemption disabling1.6.5.2. Interrupt disabling1.6.5.3. Semaphores1.6.5.4. Spin locks1.6.5.5. Avoiding deadlocks1.6.6. Signals and Interprocess Communication1.6.7. Process Management1.6.7.1. Zombie processes1.6.7.2. Process groups and login sessions1.6.8. Memory Management1.6.8.1. Virtual memory1.6.8.2. Random access memory usage1.6.8.3. Kernel Memory Allocator1.6.8.4. Process virtual address space handling1.6.8.5. Caching1.6.9. Device Drivers
2. Memory Addressing
2.1. Memory Addresses
2.2. Segmentation in Hardware
2.2.1. Segment Selectors and Segmentation Registers2.2.2. Segment Descriptors2.2.3. Fast Access to Segment Descriptors2.2.4. Segmentation Unit
2.3. Segmentation in Linux
2.3.1. The Linux GDT2.3.2. The Linux LDTs
2.4. Paging in Hardware
2.4.1. Regular Paging2.4.2. Extended Paging2.4.3. Hardware Protection Scheme2.4.4. An Example of Regular Paging2.4.5. The Physical Address Extension (PAE) Paging Mechanism2.4.6. Paging for 64-bit Architectures2.4.7. Hardware Cache2.4.8. Translation Lookaside Buffers (TLB)
2.5. Paging in Linux
2.5.1. The Linear Address Fields2.5.2. Page Table Handling2.5.3. Physical Memory Layout2.5.4. Process Page Tables2.5.5. Kernel Page Tables2.5.5.1. Provisional kernel Page Tables2.5.5.2. Final kernel Page Table when RAM size is less than 896 MB2.5.5.3. Final kernel Page Table when RAM size is between 896 MB and 4096 MB2.5.5.4. Final kernel Page Table when RAM size is more than 4096 MB2.5.6. Fix-Mapped Linear Addresses2.5.7. Handling the Hardware Cache and the TLB2.5.7.1. Handling the hardware cache2.5.7.2. Handling the TLB
3. Processes
3.1. Processes, Lightweight Processes, and Threads
3.2. Process Descriptor
3.2.1. Process State3.2.2. Identifying a Process3.2.2.1. Process descriptors handling3.2.2.2. Identifying the current process3.2.2.3. Doubly linked lists3.2.2.4. The process list3.2.2.5. The lists of TASK_RUNNING processes3.2.3. Relationships Among Processes3.2.3.1. The pidhash table and chained lists3.2.4. How Processes Are Organized3.2.4.1. Wait queues3.2.4.2. Handling wait queues3.2.5. Process Resource Limits
3.3. Process Switch
3.3.1. Hardware Context3.3.2. Task State Segment3.3.2.1. The thread field3.3.3. Performing the Process Switch3.3.3.1. The switch_to macro3.3.3.2. The _ _switch_to ( ) function3.3.4. Saving and Loading the FPU, MMX, and XMM Registers3.3.4.1. Saving the FPU registers3.3.4.2. Loading the FPU registers3.3.4.3. Using the FPU, MMX, and SSE/SSE2 units in Kernel Mode
3.4. Creating Processes
3.4.1. The clone( ), fork( ), and vfork( ) System Calls3.4.1.1. The do_fork( ) function3.4.1.2. The copy_process( ) function3.4.2. Kernel Threads3.4.2.1. Creating a kernel thread3.4.2.2. Process 03.4.2.3. Process 13.4.2.4. Other kernel threads
3.5. Destroying Processes
3.5.1. Process Termination3.5.1.1. The do_group_exit( ) function3.5.1.2. The do_exit( ) function3.5.2. Process Removal
4. Interrupts and Exceptions
4.1. The Role of Interrupt Signals
4.2. Interrupts and Exceptions
4.2.1. IRQs and Interrupts4.2.1.1. The Advanced Programmable Interrupt Controller (APIC)4.2.2. Exceptions4.2.3. Interrupt Descriptor Table4.2.4. Hardware Handling of Interrupts and Exceptions
4.3. Nested Execution of Exception and Interrupt Handlers
4.4. Initializing the Interrupt Descriptor Table
4.4.1. Interrupt, Trap, and System Gates4.4.2. Preliminary Initialization of the IDT
4.5. Exception Handling
4.5.1. Saving the Registers for the Exception Handler4.5.2. Entering and Leaving the Exception Handler
4.6. Interrupt Handling
4.6.1. I/O Interrupt Handling4.6.1.1. Interrupt vectors4.6.1.2. IRQ data structures4.6.1.3. IRQ distribution in multiprocessor systems4.6.1.4. Multiple Kernel Mode stacks4.6.1.5. Saving the registers for the interrupt handler4.6.1.6. The do_IRQ( ) function4.6.1.7. The _ _do_IRQ( ) function4.6.1.8. Reviving a lost interrupt4.6.1.9. Interrupt service routines4.6.1.10. Dynamic allocation of IRQ lines4.6.2. Interprocessor Interrupt Handling
4.7. Softirqs and Tasklets
4.7.1. Softirqs4.7.1.1. Data structures used for softirqs4.7.1.2. Handling softirqs4.7.1.3. The do_softirq( ) function4.7.1.4. The _ _do_softirq( ) function4.7.1.5. The ksoftirqd kernel threads4.7.2. Tasklets
4.8. Work Queues
4.8.1.4.8.1.1. Work queue data structures4.8.1.2. Work queue functions4.8.1.3. The predefined work queue
4.9. Returning from Interrupts and Exceptions
4.9.1.4.9.1.1. The entry points4.9.1.2. Resuming a kernel control path4.9.1.3. Checking for kernel preemption4.9.1.4. Resuming a User Mode program4.9.1.5. Checking for rescheduling4.9.1.6. Handling pending signals, virtual-8086 mode, and single stepping
5. Kernel Synchronization
5.1. How the Kernel Services Requests
5.1.1. Kernel Preemption5.1.2. When Synchronization Is Necessary5.1.3. When Synchronization Is Not Necessary
5.2. Synchronization Primitives
5.2.1. Per-CPU Variables5.2.2. Atomic Operations5.2.3. Optimization and Memory Barriers5.2.4. Spin Locks5.2.4.1. The spin_lock macro with kernel preemption5.2.4.2. The spin_lock macro without kernel preemption5.2.4.3. The spin_unlock macro5.2.5. Read/Write Spin Locks5.2.5.1. Getting and releasing a lock for reading5.2.5.2. Getting and releasing a lock for writing5.2.6. Seqlocks5.2.7. Read-Copy Update (RCU)5.2.8. Semaphores5.2.8.1. Getting and releasing semaphores5.2.9. Read/Write Semaphores5.2.10. Completions5.2.11. Local Interrupt Disabling5.2.12. Disabling and Enabling Deferrable Functions
5.3. Synchronizing Accesses to Kernel Data Structures
5.3.1. Choosing Among Spin Locks, Semaphores, and Interrupt Disabling5.3.1.1. Protecting a data structure accessed by exceptions5.3.1.2. Protecting a data structure accessed by interrupts5.3.1.3. Protecting a data structure accessed by deferrable functions5.3.1.4. Protecting a data structure accessed by exceptions and interrupts5.3.1.5. Protecting a data structure accessed by exceptions and deferrable functions5.3.1.6. Protecting a data structure accessed by interrupts and deferrable functions5.3.1.7. Protecting a data structure accessed by exceptions, interrupts, and deferrable functions
5.4. Examples of Race Condition Prevention
5.4.1. Reference Counters5.4.2. The Big Kernel Lock5.4.3. Memory Descriptor Read/Write Semaphore5.4.4. Slab Cache List Semaphore5.4.5. Inode Semaphore
6. Timing Measurements
6.1. Clock and Timer Circuits
6.1.1. Real Time Clock (RTC)6.1.2. Time Stamp Counter (TSC)6.1.3. Programmable Interval Timer (PIT)6.1.4. CPU Local Timer6.1.5. High Precision Event Timer (HPET)6.1.6. ACPI Power Management Timer
6.2. The Linux Timekeeping Architecture
6.2.1. Data Structures of the Timekeeping Architecture6.2.1.1. The timer object6.2.1.2. The jiffies variable6.2.1.3. The xtime variable6.2.2. Timekeeping Architecture in Uniprocessor Systems6.2.2.1. Initialization phase6.2.2.2. The timer interrupt handler6.2.3. Timekeeping Architecture in Multiprocessor Systems6.2.3.1. Initialization phase6.2.3.2. The global timer interrupt handler6.2.3.3. The local timer interrupt handler
6.3. Updating the Time and Date
6.4. Updating System Statistics
6.4.1. Updating Local CPU Statistics6.4.2. Keeping Track of System Load6.4.3. Profiling the Kernel Code6.4.4. Checking the NMI Watchdogs
6.5. Software Timers and Delay Functions
6.5.1. Dynamic Timers6.5.1.1. Dynamic timers and race conditions6.5.1.2. Data structures for dynamic timers6.5.1.3. Dynamic timer handling6.5.2. An Application of Dynamic Timers: the nanosleep( ) System Call6.5.3. Delay Functions
6.6. System Calls Related to Timing Measurements
6.6.1. The time( ) and gettimeofday( ) System Calls6.6.2. The adjtimex( ) System Call6.6.3. The setitimer( ) and alarm( ) System Calls6.6.4. System Calls for POSIX Timers
7. Process Scheduling
7.1. Scheduling Policy
7.1.1. Process Preemption7.1.2. How Long Must a Quantum Last?
7.2. The Scheduling Algorithm
7.2.1. Scheduling of Conventional Processes7.2.1.1. Base time quantum7.2.1.2. Dynamic priority and average sleep time7.2.1.3. Active and expired processes7.2.2. Scheduling of Real-Time Processes
7.3. Data Structures Used by the Scheduler
7.3.1. The runqueue Data Structure7.3.2. Process Descriptor
7.4. Functions Used by the Scheduler
7.4.1. The scheduler_tick( ) Function7.4.1.1. Updating the time slice of a real-time process7.4.1.2. Updating the time slice of a conventional process7.4.2. The try_to_wake_up( ) Function7.4.3. The recalc_task_prio( ) Function7.4.4. The schedule( ) Function7.4.4.1. Direct invocation7.4.4.2. Lazy invocation7.4.4.3. Actions performed by schedule( ) before a process switch7.4.4.4. Actions performed by schedule( ) to make the process switch7.4.4.5. Actions performed by schedule( ) after a process switch
7.5. Runqueue Balancing in Multiprocessor Systems
7.5.1. Scheduling Domains7.5.2. The rebalance_tick( ) Function7.5.3. The load_balance( ) Function7.5.4. The move_tasks( ) Function
7.6. System Calls Related to Scheduling
7.6.1. The nice( ) System Call7.6.2. The getpriority( ) and setpriority( ) System Calls7.6.3. The sched_getaffinity( ) and sched_setaffinity( ) System Calls7.6.4. System Calls Related to Real-Time Processes7.6.4.1. The sched_getscheduler( ) and sched_setscheduler( ) system calls7.6.4.2. The sched_ getparam( ) and sched_setparam( ) system calls7.6.4.3. The sched_ yield( ) system call7.6.4.4. The sched_ get_priority_min( ) and sched_ get_priority_max( ) system calls7.6.4.5. The sched_rr_ get_interval( ) system call
8. Memory Management
8.1. Page Frame Management
8.1.1. Page Descriptors8.1.2. Non-Uniform Memory Access (NUMA)8.1.3. Memory Zones8.1.4. The Pool of Reserved Page Frames8.1.5. The Zoned Page Frame Allocator8.1.5.1. Requesting and releasing page frames8.1.6. Kernel Mappings of High-Memory Page Frames8.1.6.1. Permanent kernel mappings8.1.6.2. Temporary kernel mappings8.1.7. The Buddy System Algorithm8.1.7.1. Data structures8.1.7.2. Allocating a block8.1.7.3. Freeing a block8.1.8. The Per-CPU Page Frame Cache8.1.8.1. Allocating page frames through the per-CPU page frame caches8.1.8.2. Releasing page frames to the per-CPU page frame caches8.1.9. The Zone Allocator8.1.9.1. Releasing a group of page frames
8.2. Memory Area Management
8.2.1. The Slab Allocator8.2.2. Cache Descriptor8.2.3. Slab Descriptor8.2.4. General and Specific Caches8.2.5. Interfacing the Slab Allocator with the Zoned Page Frame Allocator8.2.6. Allocating a Slab to a Cache8.2.7. Releasing a Slab from a Cache8.2.8. Object Descriptor8.2.9. Aligning Objects in Memory8.2.10. Slab Coloring8.2.11. Local Caches of Free Slab Objects8.2.12. Allocating a Slab Object8.2.13. Freeing a Slab Object8.2.14. General Purpose Objects8.2.15. Memory Pools
8.3. Noncontiguous Memory Area Management
8.3.1. Linear Addresses of Noncontiguous Memory Areas8.3.2. Descriptors of Noncontiguous Memory Areas8.3.3. Allocating a Noncontiguous Memory Area8.3.4. Releasing a Noncontiguous Memory Area
9. Process Address Space
9.1. The Process’s Address Space
9.2. The Memory Descriptor
9.2.1. Memory Descriptor of Kernel Threads
9.3. Memory Regions
9.3.1. Memory Region Data Structures9.3.2. Memory Region Access Rights9.3.3. Memory Region Handling9.3.3.1. Finding the closest region to a given address: find_vma( )9.3.3.2. Finding a region that overlaps a given interval: find_vma_intersection( )9.3.3.3. Finding a free interval: get_unmapped_area( )9.3.3.4. Inserting a region in the memory descriptor list: insert_vm_struct( )9.3.4. Allocating a Linear Address Interval9.3.5. Releasing a Linear Address Interval9.3.5.1. The do_munmap( ) function9.3.5.2. The split_vma( ) function9.3.5.3. The unmap_region( ) function
9.4. Page Fault Exception Handler
9.4.1. Handling a Faulty Address Outside the Address Space9.4.2. Handling a Faulty Address Inside the Address Space9.4.3. Demand Paging9.4.4. Copy On Write9.4.5. Handling Noncontiguous Memory Area Accesses
9.5. Creating and Deleting a Process Address Space
9.5.1. Creating a Process Address Space9.5.2. Deleting a Process Address Space
9.6. Managing the Heap
10. System Calls
10.1. POSIX APIs and System Calls
10.2. System Call Handler and Service Routines
10.3. Entering and Exiting a System Call
10.3.1. Issuing a System Call via the int $0x80 Instruction10.3.1.1. The system_call( ) function10.3.1.2. Exiting from the system call10.3.2. Issuing a System Call via the sysenter Instruction10.3.2.1. The sysenter instruction10.3.2.2. The vsyscall page10.3.2.3. Entering the system call10.3.2.4. Exiting from the system call10.3.2.5. The sysexit instruction10.3.2.6. The SYSENTER_RETURN code
10.4. Parameter Passing
10.4.1. Verifying the Parameters10.4.2. Accessing the Process Address Space10.4.3. Dynamic Address Checking: The Fix-up Code10.4.4. The Exception Tables10.4.5. Generating the Exception Tables and the Fixup Code
10.5. Kernel Wrapper Routines
11. Signals
11.1. The Role of Signals
11.1.1. Actions Performed upon Delivering a Signal11.1.2. POSIX Signals and Multithreaded Applications11.1.3. Data Structures Associated with Signals11.1.3.1. The signal descriptor and the signal handler descriptor11.1.3.2. The sigaction data structure11.1.3.3. The pending signal queues11.1.4. Operations on Signal Data Structures
11.2. Generating a Signal
11.2.1. The specific_send_sig_info( ) Function11.2.2. The send_signal( ) Function11.2.3. The group_send_sig_info( ) Function
11.3. Delivering a Signal
11.3.1. Executing the Default Action for the Signal11.3.2. Catching the Signal11.3.2.1. Setting up the frame11.3.2.2. Evaluating the signal flags11.3.2.3. Starting the signal handler11.3.2.4. Terminating the signal handler11.3.3. Reexecution of System Calls11.3.3.1. Restarting a system call interrupted by a non-caught signal11.3.3.2. Restarting a system call for a caught signal
11.4. System Calls Related to Signal Handling
11.4.1. The kill( ) System Call11.4.2. The tkill( ) and tgkill( ) System Calls11.4.3. Changing a Signal Action11.4.4. Examining the Pending Blocked Signals11.4.5. Modifying the Set of Blocked Signals11.4.6. Suspending the Process11.4.7. System Calls for Real-Time Signals
12. The Virtual Filesystem
12.1. The Role of the Virtual Filesystem (VFS)
12.1.1. The Common File Model12.1.2. System Calls Handled by the VFS
12.2. VFS Data Structures
12.2.1. Superblock Objects12.2.2. Inode Objects12.2.3. File Objects12.2.4. dentry Objects12.2.5. The dentry Cache12.2.6. Files Associated with a Process
12.3. Filesystem Types
12.3.1. Special Filesystems12.3.2. Filesystem Type Registration
12.4. Filesystem Handling
12.4.1. Namespaces12.4.2. Filesystem Mounting12.4.3. Mounting a Generic Filesystem12.4.3.1. The do_kern_mount( ) function12.4.3.2. Allocating a superblock object12.4.4. Mounting the Root Filesystem12.4.4.1. Phase 1: Mounting the rootfs filesystem12.4.4.2. Phase 2: Mounting the real root filesystem12.4.5. Unmounting a Filesystem
12.5. Pathname Lookup
12.5.1. Standard Pathname Lookup12.5.2. Parent Pathname Lookup12.5.3. Lookup of Symbolic Links
12.6. Implementations of VFS System Calls
12.6.1. The open( ) System Call12.6.2. The read( ) and write( ) System Calls12.6.3. The close( ) System Call
12.7. File Locking
12.7.1. Linux File Locking12.7.2. File-Locking Data Structures12.7.3. FL_FLOCK Locks12.7.4. FL_POSIX Locks
13. I/O Architecture and Device Drivers
13.1. I/O Architecture
13.1.1. I/O Ports13.1.1.1. Accessing I/O ports13.1.2. I/O Interfaces13.1.2.1. Custom I/O interfaces13.1.2.2. General-purpose I/O interfaces13.1.3. Device Controllers
13.2. The Device Driver Model
13.2.1. The sysfs Filesystem13.2.2. Kobjects13.2.2.1. Kobjects, ksets, and subsystems13.2.2.2. Registering kobjects, ksets, and subsystems13.2.3. Components of the Device Driver Model13.2.3.1. Devices13.2.3.2. Drivers13.2.3.3. Buses13.2.3.4. Classes
13.3. Device Files
13.3.1. User Mode Handling of Device Files13.3.1.1. Dynamic device number assignment13.3.1.2. Dynamic device file creation13.3.2. VFS Handling of Device Files
13.4. Device Drivers
13.4.1. Device Driver Registration13.4.2. Device Driver Initialization13.4.3. Monitoring I/O Operations13.4.3.1. Polling mode13.4.3.2. Interrupt mode13.4.4. Accessing the I/O Shared Memory13.4.5. Direct Memory Access (DMA)13.4.5.1. Synchronous and asynchronous DMA13.4.5.2. Helper functions for DMA transfers13.4.5.3. Bus addresses13.4.5.4. Cache coherency13.4.5.5. Helper functions for coherent DMA mappings13.4.5.6. Helper functions for streaming DMA mappings13.4.6. Levels of Kernel Support
13.5. Character Device Drivers
13.5.1. Assigning Device Numbers13.5.1.1. The register_chrdev_region( ) and alloc_chrdev_region( ) functions13.5.1.2. The register_chrdev( ) function13.5.2. Accessing a Character Device Driver13.5.3. Buffering Strategies for Character Devices
14. Block Device Drivers
14.1. Block Devices Handling
14.1.1. Sectors14.1.2. Blocks14.1.3. Segments
14.2. The Generic Block Layer
14.2.1. The Bio Structure14.2.2. Representing Disks and Disk Partitions14.2.3. Submitting a Request
14.3. The I/O Scheduler
14.3.1. Request Queue Descriptors14.3.2. Request Descriptors14.3.2.1. Managing the allocation of request descriptors14.3.2.2. Avoiding request queue congestion14.3.3. Activating the Block Device Driver14.3.4. I/O Scheduling Algorithms14.3.4.1. The “Noop” elevator14.3.4.2. The “CFQ” elevator14.3.4.3. The “Deadline” elevator14.3.4.4. The “Anticipatory” elevator14.3.5. Issuing a Request to the I/O Scheduler14.3.5.1. The blk_queue_bounce( ) function
14.4. Block Device Drivers
14.4.1. Block Devices14.4.1.1. Accessing a block device14.4.2. Device Driver Registration and Initialization14.4.2.1. Defining a custom driver descriptor14.4.2.2. Initializing the custom descriptor14.4.2.3. Initializing the gendisk descriptor14.4.2.4. Initializing the table of block device methods14.4.2.5. Allocating and initializing a request queue14.4.2.6. Setting up the interrupt handler14.4.2.7. Registering the disk14.4.3. The Strategy Routine14.4.4. The Interrupt Handler
14.5. Opening a Block Device File
15. The Page Cache
15.1. The Page Cache
15.1.1. The address_space Object15.1.2. The Radix Tree15.1.3. Page Cache Handling Functions15.1.3.1. Finding a page15.1.3.2. Adding a page15.1.3.3. Removing a page15.1.3.4. Updating a page15.1.4. The Tags of the Radix Tree
15.2. Storing Blocks in the Page Cache
15.2.1. Block Buffers and Buffer Heads15.2.2. Managing the Buffer Heads15.2.3. Buffer Pages15.2.4. Allocating Block Device Buffer Pages15.2.5. Releasing Block Device Buffer Pages15.2.6. Searching Blocks in the Page Cache15.2.6.1. The _ _find_get_block( ) function15.2.6.2. The _ _getblk( ) function15.2.6.3. The _ _bread( ) function15.2.7. Submitting Buffer Heads to the Generic Block Layer15.2.7.1. The submit_bh( ) function15.2.7.2. The ll_rw_block( ) function
15.3. Writing Dirty Pages to Disk
15.3.1. The pdflush Kernel Threads15.3.2. Looking for Dirty Pages To Be Flushed15.3.3. Retrieving Old Dirty Pages
15.4. The sync( ), fsync( ), and fdatasync( ) System Calls
15.4.1. The sync ( ) System Call15.4.2. The fsync ( ) and fdatasync ( ) System Calls
16. Accessing Files
16.1. Reading and Writing a File
16.1.1. Reading from a File16.1.1.1. The readpage method for regular files16.1.1.2. The readpage method for block device files16.1.2. Read-Ahead of Files16.1.2.1. The page_cache_readahead( ) function16.1.2.2. The handle_ra_miss( ) function16.1.3. Writing to a File16.1.3.1. The prepare_write and commit_write methods for regular files16.1.3.2. The prepare_write and commit_write methods for block device files16.1.4. Writing Dirty Pages to Disk
16.2. Memory Mapping
16.2.1. Memory Mapping Data Structures16.2.2. Creating a Memory Mapping16.2.3. Destroying a Memory Mapping16.2.4. Demand Paging for Memory Mapping16.2.5. Flushing Dirty Memory Mapping Pages to Disk16.2.6. Non-Linear Memory Mappings
16.3. Direct I/O Transfers
16.4. Asynchronous I/O
16.4.1. Asynchronous I/O in Linux 2.616.4.1.1. The asynchronous I/O context16.4.1.2. Submitting the asynchronous I/O operations
17. Page Frame Reclaiming
17.1. The Page Frame Reclaiming Algorithm
17.1.1. Selecting a Target Page17.1.2. Design of the PFRA
17.2. Reverse Mapping
17.2.1. Reverse Mapping for Anonymous Pages17.2.1.1. The try_to_unmap_anon( ) function17.2.1.2. The try_to_unmap_one( ) function17.2.2. Reverse Mapping for Mapped Pages17.2.2.1. The priority search tree17.2.2.2. The try_to_unmap_file( ) function
17.3. Implementing the PFRA
17.3.1. The Least Recently Used (LRU) Lists17.3.1.1. Moving pages across the LRU lists17.3.1.2. The mark_page_accessed( ) function17.3.1.3. The page_referenced( ) function17.3.1.4. The refill_inactive_zone( ) function17.3.2. Low On Memory Reclaiming17.3.2.1. The free_more_memory( ) function17.3.2.2. The try_to_free_pages( ) function17.3.2.3. The shrink_caches( ) function17.3.2.4. The shrink_zone( ) function17.3.2.5. The shrink_cache( ) function17.3.2.6. The shrink_list( ) function17.3.2.7. The pageout( ) function17.3.3. Reclaiming Pages of Shrinkable Disk Caches17.3.3.1. Reclaiming page frames from the dentry cache17.3.3.2. Reclaiming page frames from the inode cache17.3.4. Periodic Reclaiming17.3.4.1. The kswapd kernel threads17.3.4.2. The cache_reap( ) function17.3.5. The Out of Memory Killer17.3.6. The Swap Token
17.4. Swapping
17.4.1. Swap Area17.4.1.1. Creating and activating a swap area17.4.1.2. How to distribute pages in the swap areas17.4.2. Swap Area Descriptor17.4.3. Swapped-Out Page Identifier17.4.4. Activating and Deactivating a Swap Area17.4.4.1. The sys_swapon( ) service routine17.4.4.2. The sys_swapoff( ) service routine17.4.4.3. The try_to_unuse( ) function17.4.5. Allocating and Releasing a Page Slot17.4.5.1. The scan_swap_map( ) function17.4.5.2. The get_swap_page( ) function17.4.5.3. The swap_free( ) function17.4.6. The Swap Cache17.4.6.1. Swap cache implementation17.4.6.2. Swap cache helper functions17.4.7. Swapping Out Pages17.4.7.1. Inserting the page frame in the swap cache17.4.7.2. Updating the Page Table entries17.4.7.3. Writing the page into the swap area17.4.7.4. Removing the page frame from the swap cache17.4.8. Swapping in Pages17.4.8.1. The do_swap_page( ) function17.4.8.2. The read_swap_cache_async( ) function
18. The Ext2 and Ext3 Filesystems
18.1. General Characteristics of Ext2
18.2. Ext2 Disk Data Structures
18.2.1. Superblock18.2.2. Group Descriptor and Bitmap18.2.3. Inode Table18.2.4. Extended Attributes of an Inode18.2.5. Access Control Lists18.2.6. How Various File Types Use Disk Blocks18.2.6.1. Regular file18.2.6.2. Directory18.2.6.3. Symbolic link18.2.6.4. Device file, pipe, and socket
18.3. Ext2 Memory Data Structures
18.3.1. The Ext2 Superblock Object18.3.2. The Ext2 inode Object
18.4. Creating the Ext2 Filesystem
18.5. Ext2 Methods
18.5.1. Ext2 Superblock Operations18.5.2. Ext2 inode Operations18.5.3. Ext2 File Operations
18.6. Managing Ext2 Disk Space
18.6.1. Creating inodes18.6.2. Deleting inodes18.6.3. Data Blocks Addressing18.6.4. File Holes18.6.5. Allocating a Data Block18.6.6. Releasing a Data Block
18.7. The Ext3 Filesystem
18.7.1. Journaling Filesystems18.7.2. The Ext3 Journaling Filesystem18.7.3. The Journaling Block Device Layer18.7.3.1. Log records18.7.3.2. Atomic operation handles18.7.3.3. Transactions18.7.4. How Journaling Works
19. Process Communication
19.1. Pipes
19.1.1. Using a Pipe19.1.2. Pipe Data Structures19.1.2.1. The pipefs special filesystem19.1.3. Creating and Destroying a Pipe19.1.4. Reading from a Pipe19.1.5. Writing into a Pipe
19.2. FIFOs
19.2.1. Creating and Opening a FIFO
19.3. System V IPC
19.3.1. Using an IPC Resource19.3.2. The ipc( ) System Call19.3.3. IPC Semaphores19.3.3.1. Undoable semaphore operations19.3.3.2. The queue of pending requests19.3.4. IPC Messages19.3.5. IPC Shared Memory19.3.5.1. Swapping out pages of IPC shared memory regions19.3.5.2. Demand paging for IPC shared memory regions
19.4. POSIX Message Queues
20. Program ExZecution
20.1. Executable Files
20.1.1. Process Credentials and Capabilities20.1.1.1. Process capabilities20.1.1.2. The Linux Security Modules framework20.1.2. Command-Line Arguments and Shell Environment20.1.3. Libraries20.1.4. Program Segments and Process Memory Regions20.1.4.1. Flexible memory region layout20.1.5. Execution Tracing
20.2. Executable Formats
20.3. Execution Domains
20.4. The exec Functions
A. System Startup
A.1. Prehistoric Age: the BIOS
A.2. Ancient Age: the Boot Loader
A.2.1. Booting Linux from a Disk
A.3. Middle Ages: the setup( ) Function
A.4. Renaissance: the startup_32( ) Functions
A.5. Modern Age: the start_kernel( ) Function
B. Modules
B.1. To Be (a Module) or Not to Be?
B.1.1. Module Licenses
B.2. Module Implementation
B.2.1. Module Usage CountersB.2.2. Exporting SymbolsB.2.3. Module Dependency
B.3. Linking and Unlinking Modules
B.4. Linking Modules on Demand
B.4.1. The modprobe ProgramB.4.2. The request_module( ) Function
C. Bibliography
Books on Unix Kernels
Books on the Linux Kernel
Books on PC Architecture and Technical Manuals on Intel Microprocessors
Other Online Documentation Sources
Research Papers Related to Linux Development
About the Authors
Colophon
Copyright

Content preview from Understanding the Linux Kernel, 3rd Edition

Work Queues

The work queues have been introduced in Linux 2.6 and replace a similar construct called “task queue” used in Linux 2.4. They allow kernel functions to be activated (much like deferrable functions) and later executed by special kernel threads called worker threads .

Despite their similarities, deferrable functions and work queues are quite different. The main difference is that deferrable functions run in interrupt context while functions in work queues run in process context. Running in process context is the only way to execute functions that can block (for instance, functions that need to access some block of data on disk) because, as already observed in the section "Nested Execution of Exception and Interrupt Handlers" earlier in this chapter, no process switch can take place in interrupt context. Neither deferrable functions nor functions in a work queue can access the User Mode address space of a process. In fact, a deferrable function cannot make any assumption about the process that is currently running when it is executed. On the other hand, a function in a work queue is executed by a kernel thread, so there is no User Mode address space to access.

Work queue data structures

The main data structure associated with a work queue is a descriptor called workqueue_struct, which contains, among other things, an array of NR_CPUS elements, the maximum number of CPUs in the system.^[*] Each element is a descriptor of type cpu_workqueue_struct, whose fields are shown in ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Understanding the Linux Kernel, Second Edition

Publisher Resources

ISBN: 0596005652Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Understanding the Linux Kernel, 3rd Edition

by Daniel P. Bovet, Marco Cesati

Work Queues

Work queue data structures

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Understanding the Linux Kernel, Second Edition

Understanding the Linux Kernel

The Linux Programming Interface

Linux Kernel Programming

Publisher Resources