Chapter 69. Lexical Analysis

Chip Salzenberg

It’s been said that “the only program that can parse Perl is perl.” If you’ve ever tried to get a smart editor like Emacs to properly indent your Perl program, you’ll probably agree. And while Ilya Zakharevich has made great strides with cperl-mode.el, Perl’s syntax is still more complex and exception-ridden than most.

Now, ask yourself: given that Perl’s syntax is riddled with oddities, exceptions, and attempts to do what you mean instead of what you say, what bizarre twists and turns must a program take to understand it? You’re about to find out.

Tokenizing

Lexical analysis consists of turning a source file—a single unbroken stream of characters—into discrete units, called tokens. (That’s why lexical analysis is often called tokenizing.) Tokens are the fundamental units of a programming language. Typical tokens are identifiers like foo, literal strings like "bar", and operator names or symbols like print or +.

The next stage after lexical analysis, called parsing, takes those tokens and, based on their context, figures out what they mean. After all, foo might be a subroutine name, a filehandle, or even a variable name if it follows a dollar sign. The full glory of parsing is a discussed in Chapter 21.

Lexical analysis of Perl is a seriously hairy job, and toke.c contains some seriously hairy code. (Like the rest of Perl, the tokenizer is written in C.) You’d probably find an exhaustive treatment of its ins and outs to be, well, exhausting. ...

Get Computer Science & Perl Programming now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.