BUY THIS BOOK
Add to Cart

Print Book $29.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £20.95

What is this?

Looking to Reprint this content?


lex & yacc
lex & yacc, Second Edition

By John Levine, Tony Mason, Doug Brown
Price: $29.95 USD
£20.95 GBP€59.00 EUR$59.95 AUD

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Lex and Yacc
Lex and yacc help you write programs that transform structured input. This includes an enormous range of applications—anything from a simple text search program that looks for patterns in its input file to a C compiler that transforms a source program into optimized object code.
In programs with structured input, two tasks that occur over and over are dividing the input into meaningful units, and then discovering the relationship among the units. For a text search program, the units would probably be lines of text, with a distinction between lines that contain a match of the target string and lines that don't. For a C program, the units are variable names, constants, strings, operators, punctuation, and so forth. This division into units (which are usually called tokens) is known as lexical analysis, or lexing for short. Lex helps you by taking a set of descriptions of possible tokens and producing a C routine, which we call a lexical analyzer, or a lexer, or a scanner for short, that can identify those tokens. The set of descriptions you give to lex is called a lex specification.
The token descriptions that lex uses are known as regular expressions, extended versions of the familiar patterns used by the grep and egrep commands. Lex turns these regular expressions into a form that the lexer can use to scan the input text extremely fast, independent of the number of expressions that it is trying to match. A lex lexer is almost always faster than a lexer that you might write in C by hand.
As the input is divided into tokens, a program often needs to establish the relationship among the tokens. A C compiler needs to find the expressions, statements, declarations, blocks, and procedures in the program. This task is known as parsing and the list of rules that define the relationships that the program understands is a grammar. Yacc takes a concise description of a grammar and produces a C routine that can parse that grammar, a parser. The yacc parser automatically detects whenever a sequence of input tokens matches one of the rules in the grammar and also detects a syntax error whenever its input doesn't match any of the rules. A yacc parser is generally not as fast as a parser you could write by hand, but the ease in writing and modifying the parser is invariably worth any speed loss. The amount of time a program spends in a parser is rarely enough to be an issue anyway.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Using Lex
In the first chapter we demonstrated how to use lex and yacc. We now show how to use lex by itself, including some examples of applications for which lex is a good tool. We're not going to explain every last detail of lex here; consult Chapter 6, A Reference for Lex Specifications .
Lex is a tool for building lexical analyzers or lexers. A lexer takes an arbitrary input stream and tokenizes it, i.e., divides it up into lexical tokens. This tokenized output can then be processed further, usually by yacc, or it can be the "end product." In Chapter 1 we demonstrated how to use it as an intermediate step in our English grammar. We now look more closely at the details of a lex specification and how to use it; our examples use lex as the final processing step rather than as an intermediate step which passes information on to a yacc-based parser.
When you write a lex specification, you create a set of patterns which lex matches against the input. Each time one of the patterns matches, the lex program invokes C code that you provide which does something with the matched text. In this way a lex program divides the input into strings which we call tokens. Lex itself doesn't produce an executable program; instead it translates the lex specification into a file containing a C routine called yylex(). Your program calls yylex() to run the lexer.
Using your regular C compiler, you compile the file that lex produced along with any other files and libraries you want. (Note that lex and the C compiler don't even have to run on the same computer. The authors have often taken the C code from UNIX lex to other computers where lex is not available but C is.)
Before we describe the structure of a lex specification, we need to describe regular expressions as used by lex. Regular expressions are widely used within the UNIX environment, and lex uses a rich regular expression language.
A regular expression is a pattern description using a "meta" language, a language that you use to describe particular patterns of interest. The characters used in this metalanguage are part of the standard ASCII character set used in UNIX and MS-DOS, which can sometimes lead to confusion. The characters that form regular expressions are:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Using Yacc
The previous chapter concentrated on lex alone. In this chapter we turn our attention to yacc, although we use lex to generate our lexical analyzers. Where lex recognizes regular expressions, yacc recognizes entire grammars. Lex divides the input stream into pieces (tokens) and then yacc takes these pieces and groups them together logically.
In this chapter we create a desk calculator, starting with simple arithmetic, then adding built-in functions, user variables, and finally user-defined functions.
Yacc takes a grammar that you specify and writes a parser that recognizes valid "sentences" in that grammar. We use the term "sentence" here in a fairly general way—for a C language grammar the sentences are syntactically valid C programs.
As we saw in Chapter 1, a grammar is a series of rules that the parser uses to recognize syntactically valid input. For example, here is a version of the grammar we'll use later in this chapter to build a calculator.
            statement → NAME = expression

        expression → NUMBER + NUMBER | NUMBER − NUMBER
         
The vertical bar, "|", means there are two possibilities for the same symbol, i.e., an expression can be either an addition or a subtraction. The symbol to the left of the → is known as the left-hand side of the rule, often abbreviated LHS, and the symbols to the right are the right-hand side, usually abbreviated RHS. Several rules may have the same left-hand side; the vertical bar is just a short hand for this. Symbols that actually appear in the input and are returned by the lexer are terminal symbols or tokens, while those that appear on the left-hand side of some rule are non-terminal symbols or non-terminals. Terminal and non-terminal symbols must be different; it is an error to write a rule with a token on the left side.
The usual way to represent a parsed sentence is as a tree. For example, if we parsed the input "fred = 12 + 13" with this grammar, the tree would look like Figure 3-1. "12 + 13" is an
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: A Menu Generation Language
The previous chapter provided a simple example of an interpreter, a desktop calculator. In this chapter, we turn our attention to compiler design by developing a menu generation language (MGL) and its associated compiler. We begin with a description of the language that we are going to create. Then we look at several iterations of developing the lex and yacc specifications. Lastly, we create the actions associated with our grammar and which implement the features of the MGL.
We'll develop a language that can be used to generate custom menu interfaces. It will take as input a description file and produce a C program that can be compiled to create this output on a user's terminal, using the standard curses library to draw the menus on the screen.
In many cases when an application requires a lot of tedious and repetitive code, it is faster and easier to design a special purpose language and write a little compiler that translates your language into C or something else your computer can already handle. Curses programming is tedious, because you have to position all of the data on the screen yourself. MGL automates most of the layout, greatly easing the job.
The menu description consists of the following:
  1. A name for the menu screen
  2. A title or titles
  3. A list of menu items, each consisting of:
    item
    [ command ]
    action
    [ attribute ]
    where item is the text string that appears on the menu, command is the mnemonic used to provide command-line access to the functions of the menu system, action is the procedure that should be performed when a menu item is chosen, and attribute indicates how this item should be handled. The bracketed items are optional.
  4. A terminator
Since a useful application usually has several menus, a description file can contain several different named menus.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: Parsing SQL
SQL (which stands for Structured Query Language and is usually pronounced sequel) is the most common language used to handle relational data bases. First we'll develop a SQL parser that checks the syntax of its input but doesn't do anything else with it. Then we'll turn that into a preprocessor for SQL embedded in C programs.
This parser is based on the definition of SQL in C. J. Date, A Guide to the SQL Standard, Second Edition, Addison-Wesley, 1989. Date's description is written in Backus-Naur Form or BNF, a standard form used to write formal language descriptions. Yacc input syntax is similar to BNF except for the punctuation, so in many places it was sufficient to transliterate the BNF to get the corresponding yacc rules. In most cases we use the same symbol names Date did, although in a few places we've deviated from his usage in order to make the grammar suitable for yacc.
The ultimate definitions for SQL are the standards documents, ANSI X3.135-1989 (which defines SQL itself) and ANSI X3.168-1989 (which defines the way to embed SQL in other programming languages).
SQL is a special purpose language for relational data bases. Rather than manipulating data in memory, it manipulates data in data base tables, referring to memory only incidentally.
A data base is a collection of tables, which are analogous to files. Each table contains rows and columns, which are analogous to records and fields. The rows in a table are not kept in any particular order. You create a set of tables by giving the name and type of each column:
    CREATE SCHEMA
    AUTHORIZATION JOHNL

    CREATE TABLE Foods (
          name CHAR(8) NOT NULL,
          type CHAR(5),
          flavor    CHAR(6),
          PRIMARY KEY ( name )
    )

    CREATE TABLE Courses (
          course      CHAR(8) NOT NULL PRIMARY KEY,
          flavor      CHAR(6),
          sequence INTEGER
    )
The syntax is completely free-format and there are often several different syntactic ways to write the same thing—notice the two different ways we gave the PRIMARY KEY specifier. (The
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 6: A Reference for Lex Specifications
In this chapter, we discuss the format of the lex specification and describe the features and options available. This chapter summarizes the capabilities demonstrated in previous chapters and covers features that have not been discussed.
After the section on the structure of a lex program, the sections in this chapter are in alphabetical order by feature.
A lex program consists of three parts: the definition section, the rules section, and the user subroutines.
    ...definition section ...
    %%
            
    ... rules section ...
    %%
    ... user subroutines ...
The parts are separated by lines consisting of two percent signs. The first two parts are required, although a part may be empty. The third part and the preceding %% line may be omitted. (This structure is the same as that used by yacc, from which it was copied.)
The definition section can include the literal block, definitions, internal table declarations, start conditions, and translations. (There is a section on each in this reference.) Lines that start with whitespace are copied verbatim to the C file. Typically this is used to include comments enclosed in "/*" and "*/", preceded by whitespace.
The rules section contains pattern lines and C code. A line that starts with whitespace, or material enclosed in "%{" and "%}" is C code. A line that starts with anything else is a pattern line.
C code lines are copied verbatim to the generated C file. Lines at the beginning of the rules section are placed near the beginning of the generated yylex() function, and should be declarations of variables used by code associated with the patterns and initialization code for the scanner. C code lines anywhere else are copied to an unspecified place in the generated C file, and should contain only comments. (This is how you put comments in the rules section outside of actions.)
Pattern lines contain a pattern followed by some whitespace and C code to execute when the input matches the pattern. If the C code is more than one statement or spans multiple lines, it must be enclosed in braces {}.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 7: A Reference for Yacc Grammars
In this chapter, we discuss the format of the yacc grammar and describe the various features and options available. This chapter summarizes the capabilities demonstrated in the examples in previous chapters and covers features not yet mentioned.
After the section on the structure of a yacc grammar, the sections in this chapter are in alphabetical order by feature.
A yacc grammar consists of three sections: the definition section, the rules section, and the user subroutines section.
    ... definition section ...
    %%
    ... rules section ...
    %%
    ... user subroutines section ...
         
The sections are separated by lines consisting of two percent signs. The first two sections are required, although a section may be empty. The third section and the preceding "%%" line may be omitted. (Lex uses the same structure.)
A yacc grammar is constructed from symbols, the "words" of the grammar. Symbols are strings of letters, digits, periods, and underscores that do not start with a digit. The symbol error is reserved for error recovery, otherwise yacc attaches no a priori meaning to any symbol.
Symbols produced by the lexer are called terminal symbols or tokens. Those that are defined on the left-hand side of rules are called nonterminal symbols or non-terminals. Tokens may also be literal quoted characters. (See "Literal Tokens.") A widely-followed convention makes token names all uppercase and non-terminals lowercase. We follow that convention throughout the book.
The definition section can include a literal block, C code copied verbatim to the beginning of the generated C file, usually containing declaration and #include lines. There may be %union, %start, %token, %type, %left, %right, and %nonassoc declarations. (See "%union Declaration," "Start Declaration," "Tokens," "%type Declarations," and "Precedence and Operator Declarations.") It can also contain comments in the usual C format, surrounded by "/*" and "*/". All of these are optional, so in a very simple parser the definition section may be completely empty.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 8: Yacc Ambiguities and Conflicts
This chapter focuses on finding and correcting conflicts within a yacc grammar. Conflicts occur when yacc reports shift/reduce and reduce/reduce errors. Finding them can be challenging because yacc points to them in y.output, which we will describe in this chapter, rather than in your yacc grammar file. Before reading this chapter, you should understand the general way that yacc parsers work, described in in Chapter 3, Using Yacc .
To describe what a conflict is in terms of the yacc grammar, we introduce a model of yacc's operation. In this model, a pointer moves through the yacc grammar as each individual token is read. When you start, there is one pointer (represented here as an up-arrow, ↑) at the beginning of the start rule:
    %token A B C
    %%
    start:     ↑ A B C;
As the yacc parser reads tokens, the pointer moves. Say it reads A and B:
    %token A B C
    %%
    start:     A B ↑ C;
At times, there may be more than one pointer because of the alternatives in your yacc grammar. For example, suppose with the following grammar it reads A and B:
    %token A B C D E F
    %%
    start:      x
         |      y;
    x:    A B ↑ C D;
    y:    A B ↑ E F;
(For the rest of the examples in this chapter, we will leave out the %token and the %%.) There are two ways for pointers to disappear. One is for a token to eliminate one or more pointers because only one still matches the input. If the next token that yacc reads is C, the second pointer will disappear, and the first pointer advances:
    start:      x
         |      y;
    x:    A B C ↑ D;
    y:    A B E F;
The other way for pointers to disappear is for them to merge in a common subrule. In this example, z appears in both x and y:
    start:      x
         |      y;
    x:    A B z R;
    y:    A B z S;
    z:C D
After reading A, there are two pointers:
    start:      x
         |      y;
    x:    A ↑ B z R;
    y:    A ↑ B z S;
    z:    C D
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 9: Error Reporting and Recovery
The previous two chapters discussed techniques for finding errors within yacc grammars. In this chapter, we turn our attention to the other side of error correction and detection—how the parser and lexical analyzer detect errors. This chapter presents some techniques to incorporate error detection and reporting into the grammar. To ground the discussion in a complete example, we will refer to the menu generation language defined in Chapter 4, A Menu Generation Language .
Yacc provides the error token and the yyerror routine, which are typically sufficient for early versions of a tool. However, as any program begins to mature, especially a programming tool, it becomes important to provide better error recovery, which allows for detection of errors in the later portions of the file, and better error reporting.
Error reporting should give as much detail about the error as is possible. The default yacc error only declares that a syntax error exists and to stop parsing. In our examples, we typically added a mechanism for reporting the line number. This provides the location of the error but does not report any other errors within the file or where in the specified line the error occurs.
It is best to categorize the possible errors, perhaps building an array of error types and defining symbolic constants to identify the errors. For example, in the MGL a possible error is to fail to terminate a string. Another error might be using the wrong type of string (quoted string instead of an identifier or vice versa). At a minimum, the MGL should report:
  • General syntactic errors (e.g., a line that makes no sense)
  • A nonterminated string
  • The wrong type of string (quoted instead of unquoted or vice versa)
  • A premature end-of-file
  • Duplicate names used
Our existing mechanism that reports a syntax error with the line number is a good one; if we cannot identify the error, we will use this as a fallback. We will place other more specific error reports where we recognize the possibility of such an error. In general, this should be enough to point out the offending line in the input file, which in turn is often enough to determine the nature of the error.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : AT&T Lex
AT&T lex is the most common version found on UNIX systems. If you're not sure which version of lex you have, try running a lexer through it with the -v flag. If the produces a terse two-line summary like this, it's AT&T lex:
    5/2000 nodes(%e), 16/5000 positions(%p), 5/2500 (%n),
    4 transitions, 0/1000 packed char classes(%k), 6/5000
    packed transitions(%a), 113/5000 output slots(%o)
If produces a page of statistics with lex's version number on the first line, it's flex.
Lex processes a specification file and generates source code for a lexical analyzer. By convention, the specification file has a .l extension. The file that lex generates is named lex.yy.c.
The syntax of the AT&T lex command is:
lex [options] file
where options are as follows:
-c
Writes the lexer in C (default). The obsolescent flag is not present in many versions.
-n
Don't print the summary line with the table sizes. This is the default unless the definition section changes the size of one of lex's internal tables.
-r
Actions are written in RATFOR, a dialect of FORTRAN. This option no longer works in most versions of lex, and is not even present in many of them.
-t
Source code is sent to standard output instead of to the default file lex.yy.c. This is useful in Makefiles and shell scripts that direct the output of lex to a named file.
-v
Generates a one-line statistical summary of the finite state machine. This option is implied when any of the tables sizes are specified in the definitions section of the lex specification.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : AT&T Yacc
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : Berkeley Yacc
Berkeley yacc is a nearly exact reimplementation of AT&T yacc with few extra features.

Options

Error Messages

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : GNU Bison
The GNU project's yacc replacement is called bison. Briefly, GNU (Gnu's Not UNIX) is the project of the Free Software Foundation and is an attempt to create a UNIX-like operating system with source code available publicly (although GNU is not public domain, it is freely available and has a license intended to keep it freely available). Hence, bison is available to anyone. For more information on how to obtain bison, GNU, or the Free Software Foundation, contact:
         Free Software Foundation, Inc.
         675 Massachusetts Avenue
         Cambridge, MA 02139
         (617) 876-3296
      
Users with access to the Internet can FTP bison and all other GNU software from prep.ai.mit.edu in the directory /pub/gnu.
Parsers generated with some versions of bison are subject to the GNU "copyleft" software license which sets conditions on the distribution of GNU and GNU-derived software. If you plan to use bison to develop a program distributed to others, be sure to check the file COPYING included with the bison distribution to see if you agree to the terms.
This description reflects bison version 1.18, which was released in May 1992.

Differences

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : Flex
A freely available version of lex is flex. It is the version of lex distributed with 4.4BSD and by the GNU project. Internet users can also FTP it from ftp.ee.lbl.gov. The most significant advantages of flex are that it is much more reliable than AT&T lex, generates faster lexical analyzers, and does not have lex's limitations upon table size. Flex may be redistributed with no requirements other than reproducing the authors' copyright and disclaimer, and there are no distribution restrictions at all on flex scanners.
Flex is highly compatible with lex. Some AT&T lex scanners will need to be modified to work with flex, as detailed below.
This description reflects flex version 2.3.7, released in March 1991.

Flex Differences

Options

Error Messages

Flex Versions of Lexer Examples

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : MKS lex and yacc
Mortice Kern Systems has a lex and yacc package that runs under MS-DOS and OS/2. It includes an excellent 450 page manual, so in this discussion concentrates on the differences between MKS and other implementations. It is available from:
         Mortice Kern Systems
         35 King Street North
         Waterloo, ON N2J2W9
         Canada
         Phone: +1 519 884 2251
         or in the U.S. (800) 265-2797
E-mail: inquiry@mks.com
      

Differences

New Features

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : Abraxas lex and yacc
Abraxas Software offers pcyacc, which contains pcyacc and pclex, MS-DOS and OS/2 versions of yacc and lex. It is available from:
         Abraxas Software
         7033 SW Macadam Avenue
         Portland OR 97219
         Phone: +1 503 244 5253
      
Pclex is based on flex, so much of what we have said about flex also applies to pclex.

Differences

New Features

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : POSIX lex and yacc
The IEEE POSIX P1003.2 standard will define portable standard versions of lex and yacc. In nearly all cases the standards merely codify long-standing existing practice. POSIX lex closely resembles flex, minus the more exotic features. POSIX yacc closely resembles AT&T or Berkeley yacc. The input and output filenames are identical to those in flex and AT&T yacc.

Options

Differences

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : MGL Compiler Code
Chapter 4, A Menu Generation Language , presented the lex and yacc grammars for the MGL. Here we present the entire code of the MGL, including runtime support code not shown in the chapter. Many improvements to the runtime code are possible, such as:
  • Screen clearing after improper input.
  • Better handling of the system() call, particularly saving and restoring terminal modes and screen contents.
  • Automatic generation of a main routine. Currently, it must be defined outside the calling program. Furthermore, it must call the routine menu_cleanup before exiting.
  • More flexible command handling, e.g., allowing unique prefixes of commands rather than requiring the whole command.
  • Taking keyboard input a character at a time in cbreak mode rather than the current line at a time.
  • More flexible nesting of menus.
See the Preface for information on obtaining an online copy of this code.

MGL Yacc Source

MGL Lex Source

Supporting C Code

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : SQL Parser Code
Here we display the complete code for the embedded SQL translator, including the lexer, the parser, and the supporting C code. Since the parser is so long, we have numbered the lines and included a cross-reference of all of the symbols by line number at the end.
The main() and yyerror() routines are at the end of the lex scanner.

Yacc Parser

Cross-reference

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Appendix : SQL Parser Code
Recycle this page

Lex Scanner

Supporting Code

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Return to lex & yacc