Chapter 1. A Regular Expression Matcher
Brian Kernighan
Regular expressions are notations for describing patterns of text and, in effect, make up a special-purpose language for pattern matching. Although there are myriad variants, all share the idea that most characters in a pattern match literal occurrences of themselves, but some metacharacters have special meaning, such as * to indicate some kind of repetition or […] to mean any one character from the set within the brackets.
In practice, most searches in programs such as text editors are for
literal words, so the regular expressions are often literal strings like
print, which will match printf or sprint or printer
paper anywhere. In so-called wildcards used
to specify filenames in Unix and Windows, a * matches any number of
characters, so the pattern *.c matches
all filenames that end in .c. There are
many, many variants of regular expressions, even in contexts where one
would expect them to be the same. Jeffrey Friedl’s Mastering
Regular Expressions (O’Reilly) is an exhaustive study of the
topic.
Stephen Kleene invented regular expressions in the mid-1950s as a notation for finite automata; in fact, they are equivalent to finite automata in what they represent. They first appeared in a program setting in Ken Thompson’s version of the QED text editor in the mid-1960s. In 1967, Thompson applied for a patent on a mechanism for rapid text matching based on regular expressions. The patent was granted in 1971, one of the very first ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access