Writing a C Compiler

Book description

Compilers are at the heart of everything programmers do, yet even experienced developers find them intimidating. For those eager to truly grasp how compilers work, Writing a C Compiler dispels the mystery. This book guides you through a fun and engaging project where you’ll learn what it takes to compile a real-world programming language to actual assembly code.

Writing a C Compiler will take you step by step through the process of building your own compiler for a significant subset of C—no prior experience with compiler construction or assembly code needed. Once you’ve built a working compiler for the simplest C program, you’ll add new features chapter by chapter. The algorithms in the book are all in pseudocode, so you can implement your compiler in whatever language you like. Along the way, you’ll explore key concepts like:

  • Lexing and parsing: Learn how to write a lexer and recursive descent parser that transform C code into an abstract syntax tree.
  • Program analysis: Discover how to analyze a program to understand its behavior and detect errors.
  • Code generation: Learn how to translate C language constructs like arithmetic operations, function calls, and control-flow statements into x64 assembly code.
  • Optimization techniques: Improve performance with methods like constant folding, dead store elimination, and register allocation.

Compilers aren’t terrifying beasts—and with help from this hands-on, accessible guide, you might even turn them into your friends for life.

Publisher resources

View/Submit Errata

Table of contents

  1. Title Page
  2. Copyright
  3. Dedication
  4. About the Author and the Technical Reviewer
  5. Acknowledgments
  6. Introduction
    1. Who This Book Is For
    2. Why Write a C Compiler?
    3. Compilation from 10,000 Feet
    4. What You’ll Build
    5. How to Use This Book
      1. The Test Suite
      2. Extra Credit Features
    6. Some Advice on Choosing an Implementation Language
    7. System Requirements
      1. Installing GCC and GDB on Linux
      2. Installing the Command Line Developer Tools on macOS
      3. Running on Apple Silicon
      4. Validating Your Setup
    8. Additional Resources
    9. Let’s Go!
  7. Part I: The Basics
    1. 1. A Minimal Compiler
      1. The Four Compiler Passes
      2. Hello, Assembly!
      3. The Compiler Driver
      4. The Lexer
      5. The Parser
        1. An Example Abstract Syntax Tree
        2. The AST Definition
        3. The Formal Grammar
        4. Recursive Descent Parsing
      6. Assembly Generation
      7. Code Emission
      8. Summary
      9. Additional Resources
    2. 2. Unary Operators
      1. Negation and Bitwise Complement in Assembly
      2. The Stack
      3. The Lexer
      4. The Parser
      5. TACKY: A New Intermediate Representation
        1. Defining TACKY
        2. Generating TACKY
        3. Generating Names for Temporary Variables
        4. Updating the Compiler Driver
      6. Assembly Generation
        1. Converting TACKY to Assembly
        2. Replacing Pseudoregisters
        3. Fixing Up Instructions
      7. Code Emission
      8. Summary
      9. Additional Resources
    3. 3. Binary Operators
      1. The Lexer
      2. The Parser
        1. The Trouble with Recursive Descent Parsing
        2. The Adequate Solution: Refactoring the Grammar
        3. The Better Solution: Precedence Climbing
        4. Precedence Climbing in Action
      3. TACKY Generation
      4. Assembly Generation
        1. Doing Arithmetic in Assembly
        2. Converting Binary Operations to Assembly
        3. Replacing Pseudoregisters
        4. Fixing Up the idiv, add, sub, and imul Instructions
      5. Code Emission
      6. Extra Credit: Bitwise Operators
      7. Summary
      8. Additional Resources
    4. 4. Logical and Relational Operators
      1. Short-Circuiting Operators
      2. The Lexer
      3. The Parser
      4. TACKY Generation
        1. Adding Jumps, Copies, and Comparisons to the TACKY IR
        2. Converting Short-Circuiting Operators to TACKY
        3. Generating Labels
      5. Comparisons and Jumps in Assembly
        1. Comparisons and Status Flags
        2. Conditional Set Instructions
        3. Jump Instructions
      6. Assembly Generation
        1. Replacing Pseudoregisters
        2. Fixing Up the cmp Instruction
      7. Code Emission
      8. Summary
      9. Additional Resources
    5. 5. Local Variables
      1. Variables, Declarations, and Assignment
      2. The Lexer
      3. The Parser
        1. The Updated AST and Grammar
        2. An Improved Precedence Climbing Algorithm
      4. Semantic Analysis
        1. Variable Resolution
        2. The --validate Option
      5. TACKY Generation
        1. Variable and Assignment Expressions
        2. Declarations, Statements, and Function Bodies
        3. Functions with No return Statement
      6. Extra Credit: Compound Assignment, Increment, and Decrement
      7. Summary
    6. 6. If Statements and Conditional Expressions
      1. The Lexer
      2. The Parser
        1. Parsing if Statements
        2. Parsing Conditional Expressions
      3. Variable Resolution
      4. TACKY Generation
        1. Converting if Statements to TACKY
        2. Converting Conditional Expressions to TACKY
      5. Extra Credit: Labeled Statements and goto
      6. Summary
    7. 7. Compound Statements
      1. The Scoop on Scopes
      2. The Parser
      3. Variable Resolution
        1. Resolving Variables in Multiple Scopes
        2. Updating the Variable Resolution Pseudocode
      4. TACKY Generation
      5. Summary
    8. 8. Loops
      1. Loops and How to Escape Them
      2. The Lexer
      3. The Parser
      4. Semantic Analysis
        1. Extending Variable Resolution
        2. Loop Labeling
        3. Implementing Loop Labeling
      5. TACKY Generation
        1. break and continue Statements
        2. do Loops
        3. while Loops
        4. for Loops
      6. Extra Credit: switch Statements
      7. Summary
    9. 9. Functions
      1. Declaring, Defining, and Calling Functions
        1. Declarations and Definitions
        2. Function Calls
        3. Identifier Linkage
      2. Compiling Libraries
      3. The Lexer
      4. The Parser
      5. Semantic Analysis
        1. Extending Identifier Resolution
        2. Writing the Type Checker
      6. TACKY Generation
      7. Assembly Generation
        1. Understanding Calling Conventions
        2. Calling Functions with the System V ABI
        3. Converting Function Calls and Definitions to Assembly
        4. Replacing Pseudoregisters
        5. Allocating Stack Space During Instruction Fix-Up
      8. Code Emission
      9. Calling Library Functions
      10. Summary
    10. 10. File Scope Variable Declarations and Storage-Class Specifiers
      1. All About Declarations
        1. Scope
        2. Linkage
        3. Storage Duration
        4. Definitions vs. Declarations
        5. Error Cases
      2. Linkage and Storage Duration in Assembly
      3. The Lexer
      4. The Parser
        1. Parsing Type and Storage-Class Specifiers
        2. Distinguishing Between Function and Variable Declarations
      5. Semantic Analysis
        1. Identifier Resolution: Resolving External Variables
        2. Type Checking: Tracking Static Functions and Variables
      6. TACKY Generation
      7. Assembly Generation
        1. Generating Assembly for Variable Definitions
        2. Replacing Pseudoregisters According to Their Storage Duration
        3. Fixing Up Instructions
      8. Code Emission
      9. Summary
  8. Part II: Types Beyond Int
    1. 11. Long Integers
      1. Long Integers in Assembly
        1. Type Conversions
        2. Static Long Variables
      2. The Lexer
      3. The Parser
      4. Semantic Analysis
        1. Adding Type Information to the AST
        2. Type Checking Expressions
        3. Type Checking return Statements
        4. Type Checking Declarations and Updating the Symbol Table
      5. TACKY Generation
        1. Tracking the Types of Temporary Variables
        2. Generating Extra Return Instructions
      6. Assembly Generation
        1. Tracking Assembly Types in the Backend Symbol Table
        2. Replacing Longword and Quadword Pseudoregisters
        3. Fixing Up Instructions
      7. Code Emission
      8. Summary
    2. 12. Unsigned Integers
      1. Type Conversions, Again
        1. Converting Between Signed and Unsigned Types of the Same Size
        2. Converting unsigned int to a Larger Type
        3. Converting signed int to a Larger Type
        4. Converting from Larger to Smaller Types
      2. The Lexer
      3. The Parser
      4. The Type Checker
      5. TACKY Generation
      6. Unsigned Integer Operations in Assembly
        1. Unsigned Comparisons
        2. Unsigned Division
        3. Zero Extension
      7. Assembly Generation
        1. Replacing Pseudoregisters
        2. Fixing Up the Div and MovZeroExtend Instructions
      8. Code Emission
      9. Summary
    3. 13. Floating-Point Numbers
      1. IEEE 754, What Is It Good For?
      2. The IEEE 754 Double-Precision Format
      3. Rounding Behavior
        1. Rounding Modes
        2. Rounding Constants
        3. Rounding Type Conversions
        4. Rounding Arithmetic Operations
      4. Linking Shared Libraries
      5. The Lexer
        1. Recognizing Floating-Point Constant Tokens
        2. Matching the End of a Constant
      6. The Parser
      7. The Type Checker
      8. TACKY Generation
      9. Floating-Point Operations in Assembly
        1. Working with SSE Instructions
        2. Using Floating-Point Values in the System V Calling Convention
        3. Doing Arithmetic with SSE Instructions
        4. Comparing Floating-Point Numbers
        5. Converting Between Floating-Point and Integer Types
      10. Assembly Generation
        1. Floating-Point Constants
        2. Unary Instructions, Binary Instructions, and Conditional Jumps
        3. Type Conversions
        4. Function Calls
        5. Return Instructions
        6. The Complete Conversion from TACKY to Assembly
        7. Pseudoregister Replacement
        8. Instruction Fix-Up
      11. Code Emission
        1. Formatting Floating-Point Numbers
        2. Labeling Floating-Point Constants
        3. Storing Constants in the Read-Only Data Section
        4. Initializing Static Variables to 0.0 or –0.0
        5. Putting It All Together
      12. Extra Credit: NaN
      13. Summary
      14. Additional Resources
    4. 14. Pointers
      1. Objects and Values
      2. Operations on Pointers
        1. Address and Dereference Operations
        2. Null Pointers and Type Conversions
        3. Pointer Comparisons
        4. & Operations on Dereferenced Pointers
      3. The Lexer
      4. The Parser
        1. Parsing Declarations
        2. Parsing Type Names
        3. Putting It All Together
      5. Semantic Analysis
        1. Type Checking Pointer Expressions
        2. Tracking Static Pointer Initializers in the Symbol Table
      6. TACKY Generation
        1. Pointer Operations in TACKY
        2. A Strategy for TACKY Conversion
      7. Assembly Generation
        1. Replacing Pseudoregisters with Memory Operands
        2. Fixing Up the lea and push Instructions
      8. Code Emission
      9. Summary
    5. 15. Arrays and Pointer Arithmetic
      1. Arrays and Pointer Arithmetic
        1. Array Declarations and Initializers
        2. Memory Layout of Arrays
        3. Array-to-Pointer Decay
        4. Pointer Arithmetic to Access Array Elements
        5. Even More Pointer Arithmetic
        6. Array Types in Function Declarations
        7. Things We Aren’t Implementing
      2. The Lexer
      3. The Parser
        1. Parsing Array Declarators
        2. Parsing Abstract Array Declarators
        3. Parsing Compound Initializers
        4. Parsing Subscript Expressions
      4. The Type Checker
        1. Converting Arrays to Pointers
        2. Validating Lvalues
        3. Type Checking Pointer Arithmetic
        4. Type Checking Subscript Expressions
        5. Type Checking Cast Expressions
        6. Type Checking Function Declarations
        7. Type Checking Compound Initializers
        8. Initializing Static Arrays
        9. Initializing Scalar Variables with ZeroInit
      5. TACKY Generation
        1. Pointer Arithmetic
        2. Subscripting
        3. Compound Initializers
        4. Tentative Array Definitions
      6. Assembly Generation
        1. Converting TACKY to Assembly
        2. Replacing PseudoMem Operands
        3. Fixing Up Instructions
      7. Code Emission
      8. Summary
    6. 16. Characters and Strings
      1. Character Traits
      2. String Literals
      3. Working with Strings in Assembly
      4. The Lexer
      5. The Parser
        1. Parsing Type Specifiers
        2. Parsing Character Constants
        3. Parsing String Literals
        4. Putting It All Together
      6. The Type Checker
        1. Characters
        2. String Literals in Expressions
        3. String Literals Initializing Non-static Variables
        4. String Literals Initializing Static Variables
      7. TACKY Generation
        1. String Literals as Array Initializers
        2. String Literals in Expressions
        3. Top-Level Constants in TACKY
      8. Assembly Generation
        1. Operations on Characters
        2. Top-Level Constants
        3. The Complete Conversion from TACKY to Assembly
        4. Pseudo-Operand Replacement
        5. Instruction Fix-Up
      9. Code Emission
      10. Hello Again, World!
      11. Summary
    7. 17. Supporting Dynamic Memory Allocation
      1. The void Type
      2. Memory Management with void *
      3. Complete and Incomplete Types
      4. The sizeof Operator
      5. The Lexer
      6. The Parser
      7. The Type Checker
        1. Conversions to and from void *
        2. Functions with void Return Types
        3. Scalar and Non-scalar Types
        4. Restrictions on Incomplete Types
        5. Extra Restrictions on void
        6. Conditional Expressions with void Operands
        7. Existing Validation for Arithmetic Expressions and Comparisons
        8. sizeof Expressions
      8. TACKY Generation
        1. Functions with void Return Types
        2. Casts to void
        3. Conditional Expressions with void Operands
        4. sizeof Expressions
        5. The Latest and Greatest TACKY IR
      9. Assembly Generation
      10. Summary
    8. 18. Structures
      1. Declaring Structure Types
        1. Structure Member Declarations
        2. Tag and Member Namespaces
        3. Structure Type Declarations We Aren’t Implementing
      2. Operating on Structures
      3. Structure Layout in Memory
      4. The Lexer
      5. The Parser
      6. Semantic Analysis
        1. Resolving Structure Tags
        2. Type Checking Structures
      7. TACKY Generation
        1. Implementing the Member Access Operators
        2. Converting Compound Initializers to TACKY
      8. Structures in the System V Calling Convention
        1. Classifying Structures
        2. Passing Parameters of Structure Type
        3. Returning Structures
      9. Assembly Generation
        1. Extending the Assembly AST
        2. Copying Structures
        3. Using Structures in Function Calls
        4. Putting It All Together
        5. Replacing Pseudo-operands
      10. Code Emission
      11. Extra Credit: Unions
      12. Summary
      13. Additional Resources
  9. Part III: Optimizations
    1. 19. Optimizing Tacky Programs
      1. Safety and Observable Behavior
      2. Four TACKY Optimizations
        1. Constant Folding
        2. Unreachable Code Elimination
        3. Copy Propagation
        4. Dead Store Elimination
        5. With Our Powers Combined …
      3. Testing the Optimization Passes
      4. Wiring Up the Optimization Stage
      5. Constant Folding
        1. Constant Folding for Part I TACKY Programs
        2. Supporting Part II TACKY Programs
      6. Control-Flow Graphs
        1. Defining the Control-Flow Graph
        2. Creating Basic Blocks
        3. Adding Edges to the Control-Flow Graph
        4. Converting a Control-Flow Graph to a List of Instructions
        5. Making Your Control-Flow Graph Code Reusable
      7. Unreachable Code Elimination
        1. Eliminating Unreachable Blocks
        2. Removing Useless Jumps
        3. Removing Useless Labels
        4. Removing Empty Blocks
      8. A Little Bit About Data-Flow Analysis
      9. Copy Propagation
        1. Reaching Copies Analysis
        2. Rewriting TACKY Instructions
        3. Supporting Part II TACKY Programs
      10. Dead Store Elimination
        1. Liveness Analysis
        2. Removing Dead Stores
        3. Supporting Part II TACKY Programs
      11. Summary
      12. Additional Resources
    2. 20. Register Allocation
      1. Register Allocation in Action
        1. Take One: Put Everything on the Stack
        2. Take Two: Register Allocation
        3. Take Three: Register Allocation with Coalescing
      2. Updating the Compiler Pipeline
      3. Extending the Assembly AST
      4. Converting TACKY to Assembly
      5. Register Allocation by Graph Coloring
        1. Detecting Interference
        2. Spilling Registers
      6. The Basic Register Allocator
        1. Handling Multiple Types During Register Allocation
        2. Defining the Interference Graph
        3. Building the Interference Graph
        4. Calculating Spill Costs
        5. Coloring the Interference Graph
        6. Building the Register Map and Rewriting the Function Body
      7. Instruction Fix-Up with Callee-Saved Registers
      8. Code Emission
      9. Register Coalescing
        1. Updating the Interference Graph
        2. Conservative Coalescing
        3. Implementing Register Coalescing
      10. Summary
      11. Additional Resources
  10. Next Steps
    1. Add Some Missing Features
    2. Handle Undefined Behavior Safely
    3. Write More TACKY Optimizations
    4. Support Another Target Architecture
    5. Contribute to an Open Source Programming Language Project
    6. That’s a Wrap!
  11. Appendix A. Debugging Assembly Code With GDB or LLDB
    1. The Program
    2. Debugging with GDB
      1. Configuring the GDB UI
      2. Starting and Stopping the Program
      3. Printing Expressions
      4. Examining Memory
      5. Setting Conditional Breakpoints
      6. Getting Help
    3. Debugging with LLDB
      1. Starting and Stopping the Program
      2. Displaying Assembly Code
      3. Printing Expressions
      4. Examining Memory
      5. Setting Conditional Breakpoints
      6. Getting Help
  12. Appendix B. Assembly Generation and Code Emission Tables
    1. Part I
      1. Converting TACKY to Assembly
      2. Code Emission
    2. Part II
      1. Converting TACKY to Assembly
      2. Code Emission
    3. Part III
  13. References
  14. Index

Product information

  • Title: Writing a C Compiler
  • Author(s): Nora Sandler
  • Release date: August 2024
  • Publisher(s): No Starch Press
  • ISBN: 9781718500426