Build Your Own Programming Language - Second Edition

Book description

Embark on a journey through essential components of language design, compiler construction, preprocessors, transpilers, and runtime systems in this second edition, authored by the creator of the Unicon programming language. Purchase of the print or Kindle book includes a free PDF eBook

Key Features

  • Takes a hands-on approach; learn by building the Jzero language, a subset of Java, with example code shown in both the Java and Unicon languages
  • Learn how to create parsers, code generators, scanners, and interpreters
  • Target bytecode, native code, and preprocess or transpile code into a high-level language

Book Description

There are many reasons to build a programming language: out of necessity, as a learning exercise, or just for fun. Whatever your reasons, this book gives you the tools to succeed.

You’ll build the frontend of a compiler for your language and generate a lexical analyzer and parser using Lex and YACC tools. Then you’ll explore a series of syntax tree traversals before looking at code generation for a bytecode virtual machine or native code. In this edition, a new chapter has been added to assist you in comprehending the nuances and distinctions between preprocessors and transpilers. Code examples have been modernized, expanded, and rigorously tested, and all content has undergone thorough refreshing. You’ll learn to implement code generation techniques using practical examples, including the Unicon Preprocessor and transpiling Jzero code to Unicon. You'll move to domain-specific language features and learn to create them as built-in operators and functions. You’ll also cover garbage collection.

Dr. Jeffery’s experiences building the Unicon language are used to add context to the concepts, and relevant examples are provided in both Unicon and Java so that you can follow along in your language of choice.

By the end of this book, you'll be able to build and deploy your own domain-specific language.

What you will learn

  • Analyze requirements for your language and design syntax and semantics.
  • Write grammar rules for common expressions and control structures.
  • Build a scanner to read source code and generate a parser to check syntax.
  • Implement syntax-coloring for your code in IDEs like VS Code.
  • Write tree traversals and insert information into the syntax tree.
  • Implement a bytecode interpreter and run bytecode from your compiler.
  • Write native code and run it after assembling and linking using system tools.
  • Preprocess and transpile code into another high-level language

Who this book is for

This book is for software developers interested in the idea of inventing their own language or developing a domain-specific language. Computer science students taking compiler design or construction courses will also find this book highly useful as a practical guide to language implementation to supplement more theoretical textbooks. Intermediate or better proficiency in Java or C++ programming languages (or another high-level programming language) is assumed.

Table of contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Get in touch
  2. Section I: Programming Language Frontends
  3. Why Build Another Programming Language?
    1. Motivations for writing your own programming language
    2. Types of programming language implementations
    3. Organizing a bytecode language implementation
    4. Languages used in the examples
    5. The difference between programming languages and libraries
    6. Applicability to other software engineering tasks
    7. Establishing the requirements for your language
    8. Case study – requirements that inspired the Unicon language
      1. Unicon requirement #1 – preserve what people love about Icon
      2. Unicon requirement #2 – support large-scale programs working on big data
      3. Unicon requirement #3 – high-level input/output for modern applications
      4. Unicon requirement #4 – provide universally implementable system interfaces
    9. Summary
    10. Questions
  4. Programming Language Design
    1. Determining the kinds of words and punctuation to provide in your language
    2. Specifying the control flow
    3. Deciding on what kinds of data to support
      1. Atomic types
      2. Composite types
      3. Domain-specific types
    4. Overall program structure
    5. Completing the Jzero language definition
    6. Case study – designing graphics facilities in Unicon
      1. Language support for 2D graphics
      2. Adding support for 3D graphics
    7. Summary
    8. Questions
  5. Scanning Source Code
    1. Technical requirements
    2. Lexemes, lexical categories, and tokens
    3. Regular expressions
      1. Regular expression rules
      2. Regular expression examples
    4. Using UFlex and JFlex
      1. Header section
      2. Regular expressions section
        1. Writing a simple source code scanner
      3. Running your scanner
      4. Tokens and lexical attributes
      5. Expanding our example to construct tokens
    5. Writing a scanner for Jzero
      1. The Jzero flex specification
      2. Unicon Jzero code
      3. Java Jzero code
      4. Running the Jzero scanner
    6. Regular expressions are not always enough
    7. Summary
    8. Questions
  6. Parsing
    1. Technical requirements
    2. Syntax analysis
    3. Context-free grammars
      1. Writing context-free grammar rules
      2. Writing rules for programming constructs
    4. Using iyacc and BYACC/J
      1. Declaring symbols in the header section
      2. Advanced yacc declarations
      3. Putting together the yacc context-free grammar section
      4. Understanding yacc parsers
      5. Fixing conflicts in yacc parsers
      6. Syntax error recovery
      7. Putting together a toy example
    5. Writing a parser for Jzero
      1. The Jzero lex specification
      2. The Jzero yacc specification
      3. Unicon Jzero code
      4. Java Jzero parser code
      5. Running the Jzero parser
      6. Improving syntax error messages
      7. Adding detail to Unicon syntax error messages
      8. Adding detail to Java syntax error messages
      9. Using Merr to generate better syntax error messages
    6. Summary
    7. Questions
  7. Syntax Trees
    1. Technical requirements
    2. Using GNU Make
    3. Learning about trees
      1. Defining a syntax tree type
      2. Parse trees versus syntax trees
    4. Creating leaves from terminal symbols
      1. Wrapping tokens in leaves
      2. Working with YACC’s value stack
      3. Wrapping leaves for the parser’s value stack
      4. Determining which leaves you need
    5. Building internal nodes from production rules
      1. Accessing tree nodes on the value stack
      2. Using the tree node factory method
    6. Forming syntax trees for the Jzero language
    7. Debugging and testing your syntax tree
      1. Avoiding common syntax tree bugs
      2. Printing your tree in a text format
      3. Printing your tree using dot
    8. Summary
    9. Questions
  8. Section II: Syntax Tree Traversals
  9. Symbol Tables
    1. Technical requirements
    2. Establishing the groundwork for symbol tables
      1. Declarations and scopes
      2. Assigning and dereferencing variables
      3. Choosing the right tree traversal for the job
    3. Creating and populating symbol tables for each scope
      1. Adding semantic attributes to syntax trees
      2. Defining classes for symbol tables and symbol table entries
      3. Creating symbol tables
      4. Populating symbol tables
      5. Synthesizing the isConst attribute
    4. Checking for undeclared variables
      1. Identifying the bodies of methods
      2. Spotting uses of variables within method bodies
    5. Finding redeclared variables
      1. Inserting symbols into the symbol table
      2. Reporting semantic errors
    6. Handling package and class scopes in Unicon
      1. Mangling names
      2. Inserting self for member variable references
      3. Inserting self as the first parameter in method calls
    7. Testing and debugging symbol tables
    8. Summary
    9. Questions
  10. Checking Base Types
    1. Technical requirements
    2. Type representation in the compiler
      1. Defining a base class for representing types
      2. Subclassing the base class for complex types
    3. Assigning type information to declared variables
      1. Synthesizing types from reserved words
      2. Inheriting types into a list of variables
    4. Determining the type at each syntax tree node
      1. Determining the type at the leaves
      2. Calculating and checking the types at internal nodes
    5. Runtime type checks and type inference in Unicon
    6. Summary
    7. Questions
  11. Checking Types on Arrays, Method Calls, and Structure Accesses
    1. Technical requirements
    2. Checking operations on array types
      1. Handling array variable declarations
      2. Checking types during array creation
      3. Checking types during array accesses
    3. Checking method calls
      1. Calculating the parameters and return type information
      2. Checking the types at each method call site
      3. Checking the type at return statements
    4. Checking structured type accesses
      1. Handling instance variable declarations
      2. Checking types at instance creation
      3. Checking types of instance accesses
    5. Summary
    6. Questions
  12. Intermediate Code Generation
    1. Technical requirements
    2. What is intermediate code?
      1. Why generate intermediate code?
      2. Learning about the memory regions in the generated program
      3. Introducing data types for intermediate code
      4. Adding the intermediate code attributes to the tree
      5. Generating labels and temporary variables
    3. An intermediate code instruction set
      1. Instructions
      2. Declarations
    4. Annotating syntax trees with labels for control flow
    5. Generating code for expressions
    6. Generating code for control flow
      1. Generating label targets for condition expressions
      2. Generating code for loops
      3. Generating intermediate code for method calls
      4. Reviewing the generated intermediate code
    7. Summary
    8. Questions
  13. Syntax Coloring in an IDE
    1. Writing your own IDE versus supporting an existing one
    2. Downloading the software used in this chapter
    3. Adding support for your language to Visual Studio Code
      1. Configuring Visual Studio Code to do Syntax Highlighting for Jzero
      2. Visual Studio Code extensions using the JSON format
        1. JSON atomic types
        2. JSON collections
      3. File organization for Visual Studio Code extensions
        1. The extensions file
        2. The extension manifest
      4. Writing IDE tokenization rules using TextMate grammars
    4. Integrating a compiler into a programmer’s editor
      1. Analyzing source code from within the IDE
      2. Sending compiler output to the IDE
    5. Avoiding reparsing the entire file on every change
    6. Using lexical information to colorize tokens
      1. Extending the EditableTextList component to support color
      2. Coloring individual tokens as they are drawn
    7. Highlighting errors using parse results
    8. Summary
    9. Questions
  14. Section III: Code Generation and Runtime Systems
  15. Preprocessors and Transpilers
    1. Understanding preprocessors
      1. A preprocessing example
      2. Identity preprocessors and pretty printers
      3. The preprocessor within the Unicon preprocessor
    2. Code generation in the Unicon preprocessor
      1. Transforming objects into classes
      2. Generating source code from the syntax tree
      3. Closure-based inheritance in Unicon
    3. The difference between preprocessors and transpilers
    4. Transpiling Jzero code to Unicon
      1. Semantic attributes for transpiling to Unicon
      2. A code generation model for Jzero
      3. The Jzero to Unicon transpiler code generation method
      4. Transpiling the base cases: names and literals
      5. Handling the dot operator
      6. Mapping Java expressions to Unicon
      7. Transpiler code for method calls
      8. Assignments
      9. Transpiler code for control structures
      10. Transpiling Jzero declarations
      11. Transpiling Jzero block statements
      12. Transpiling a Jzero class into a Unicon package that contains a class
    5. Summary
    6. Questions
  16. Bytecode Interpreters
    1. Technical requirements
    2. Understanding what bytecode is
    3. Comparing bytecode with intermediate code
    4. Building a bytecode instruction set for Jzero
      1. Defining the Jzero bytecode file format
      2. Understanding the basics of stack machine operation
    5. Implementing a bytecode interpreter
      1. Loading bytecode into memory
      2. Initializing the interpreter state
      3. Fetching instructions and advancing the instruction pointer
      4. Instruction decoding
      5. Executing instructions
      6. Starting up the Jzero interpreter
    6. Writing a runtime system for Jzero
    7. Running a Jzero program
    8. Examining iconx, the Unicon bytecode interpreter
      1. Understanding goal-directed bytecode
      2. Leaving type information in at runtime
      3. Fetching, decoding, and executing instructions
      4. Crafting the rest of the runtime system
    9. Summary
    10. Questions
  17. Generating Bytecode
    1. Technical requirements
    2. Converting intermediate code to Jzero bytecode
      1. Adding a class for bytecode instructions
      2. Mapping intermediate code addresses to bytecode addresses
      3. Implementing the bytecode generator method
      4. Generating bytecode for simple expressions
      5. Generating code for pointer manipulation
      6. Generating bytecode for branches and conditional branches
      7. Generating code for method calls and returns
      8. Handling labels and other pseudo-instructions in intermediate code
    3. Comparing bytecode assembler with binary formats
      1. Printing bytecode in assembler format
      2. Printing bytecode in binary format
    4. Linking, loading, and including the runtime system
    5. Unicon example – bytecode generation in icont
    6. Summary
    7. Questions
  18. Native Code Generation
    1. Technical requirements
    2. Deciding whether to generate native code
    3. Introducing the x64 instruction set
      1. Adding a class for x64 instructions
      2. Mapping memory regions to x64 register-based address modes
    4. Using registers
      1. Starting from a null strategy
      2. Assigning registers to speed up the local region
    5. Converting intermediate code to x64 code
      1. Mapping intermediate code addresses to x64 locations
      2. Implementing the x64 code generator method
      3. Generating x64 code for simple expressions
      4. Generating code for pointer manipulation
      5. Generating native code for branches and conditional branches
      6. Generating code for method calls and returns
      7. Handling labels and pseudo-instructions
    6. Generating x64 output
      1. Writing the x64 code in assembly language format
      2. Going from native assembler to an object file
      3. Linking, loading, and including the runtime system
    7. Summary
    8. Questions
    9. Leave a review!
  19. Implementing Operators and Built-In Functions
    1. Implementing operators
      1. Comparing adding operators to adding new hardware
      2. Implementing string concatenation in intermediate code
      3. Adding String concatenation to the bytecode interpreter
      4. Adding String concatenation to the native runtime system
    2. Writing built-in functions
      1. Adding built-in functions to the bytecode interpreter
      2. Writing built-in functions for use with the native code implementation
    3. Integrating built-ins with control structures
    4. Developing operators and functions for Unicon
      1. Writing operators in Unicon
      2. Developing Unicon’s built-in functions
    5. Summary
    6. Questions
  20. Domain Control Structures
    1. Knowing when a new control structure is needed
    2. Scanning strings in Icon and Unicon
      1. Scanning environments and their primitive operations
      2. Eliminating excessive parameters via a control structure
    3. Rendering regions in Unicon
      1. Rendering 3D graphics from a display list
      2. Specifying rendering regions using built-in functions
      3. Varying levels of detail using nested rendering regions
      4. Creating a rendering region control structure
        1. Adding a reserved word for rendering regions
        2. Adding a grammar rule
        3. Checking wsection for semantic errors
        4. Generating code for a wsection control structure
    4. Summary
    5. Questions
  21. Garbage Collection
    1. Grasping the importance of garbage collection
    2. Counting references to objects
      1. Adding reference counting to Jzero
      2. Reducing the number of heap allocations for strings
      3. Modifying the generated code for the assignment operator
      4. Modifying the generated code for method call and return
      5. The drawbacks and limitations of reference counting
    3. Marking live data and sweeping the rest
      1. Organizing heap memory regions
      2. Traversing the basis to mark live data
        1. Marking the block region
      3. Reclaiming live memory and placing it into contiguous chunks
    4. Summary
    5. Questions
  22. Final Thoughts
    1. Reflecting on what was learned from writing this book
    2. Deciding where to go from here
      1. Studying programming language design
      2. Learning about implementing interpreters and bytecode machines
      3. Acquiring expertise in code optimization
      4. Monitoring and debugging program executions
      5. Designing and implementing IDEs and GUI builders
    3. Exploring references for further reading
      1. Studying programming language design
      2. Learning about implementing interpreters and bytecode machines
      3. Acquiring expertise in native code and code optimization
      4. Monitoring and debugging program executions
      5. Designing and implementing IDEs and GUI builders
    4. Summary
  23. Section IV: Appendix
  24. Appendix: Unicon Essentials
    1. Syntactic shorthand
    2. Running Unicon
    3. Using Unicon’s declarations and data types
      1. Declaring program components
      2. Using atomic data types
        1. Numeric
        2. Textual
      3. Aggregating multiple values using structure types
        1. Classes
        2. Lists
        3. Tables
        4. Sets
        5. Files
        6. Other types
    4. Evaluating expressions
      1. Forming expressions using operators
      2. Invoking procedures, functions, and methods
      3. Iterating and selecting what and how to execute
      4. Generators
    5. Debugging and environmental issues
      1. Learning the basics of the UDB debugger
      2. Environment variables
      3. Preprocessor
        1. Preprocessor commands
        2. Built-in macro definitions
    6. Function mini-reference
    7. Selected keywords
  25. Answers
  26. Other Books You May Enjoy
  27. Index

Product information

  • Title: Build Your Own Programming Language - Second Edition
  • Author(s): Clinton L. Jeffery
  • Release date: January 2024
  • Publisher(s): Packt Publishing
  • ISBN: 9781804618028