BUY THIS BOOK
This print book is out of stock, with no immediate plans to reprint.

Safari Books Online

What is this?


Looking to Reprint this content?


Effective awk Programming
Effective awk Programming, Third Edition Text Processing and Pattern Matching

By Arnold Robbins
Price: $39.95 USD
£28.50 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Getting Started with awk
The basic function of awk is to search files for lines (or other units of text) that contain certain patterns. When a line matches one of the patterns, awk performs specified actions on that line. awk keeps processing input lines in this way until it reaches the end of the input files.
Programs in awk are different from programs in most other languages, because awk programs are data-driven; that is, you describe the data you want to work with and then what to do when you find it. Most other languages are procedural; you have to describe, in great detail, every step the program is to take. When working with procedural languages, it is usually much harder to clearly describe the data your program will process. For this reason, awk programs are often refreshingly easy to read and write.
When you run awk, you specify an awk program that tells awk what to do. The program consists of a series of rules. (It may also contain function definitions, an advanced feature that we will ignore for now. See the Section 8.2 in Chapter 8.) Each rule specifies one pattern to search for and one action to perform upon finding the pattern.
Syntactically, a rule consists of a pattern followed by an action. The action is enclosed in curly braces to separate it from the pattern. Newlines usually separate rules. Therefore, an awk program looks like this:
               pattern { action }
pattern { action }
...
There are several ways to run an awk program. If the program is short, it is easiest to include it in the command that runs awk, like this:
awk 'program' input-file1 
                  input-file2 ...
When the program is long, it is usually more convenient to put it in a file and run it with a command like this:
awk -f program-file 
                  input-file1 
                  input-file2 ...
This section discusses both mechanisms, along with several variations of each.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How to Run awk Programs
There are several ways to run an awk program. If the program is short, it is easiest to include it in the command that runs awk, like this:
awk 'program' input-file1 
                  input-file2 ...
When the program is long, it is usually more convenient to put it in a file and run it with a command like this:
awk -f program-file 
                  input-file1 
                  input-file2 ...
This section discusses both mechanisms, along with several variations of each.
Once you are familiar with awk, you will often type in simple programs the moment you want to use them. Then you can write the program as the first argument of the awk command, like this:
awk 'program' input-file1 
                     input-file2 ...
where program consists of a series of patterns and actions, as described earlier.
This command format instructs the shell, or command interpreter, to start awk and use the program to process records in the input file(s). There are single quotes around program so the shell won't interpret any awk characters as special shell characters. The quotes also cause the shell to treat all of program as a single argument for awk, and allow program to be more than one line long.
This format is also useful for running short or medium-sized awk programs from shell scripts, because it avoids the need for a separate file for the awk program. A self-contained shell script is more reliable because there are no other files to misplace.
Section 1.3 later in this chapter presents several short, self-contained programs.
You can also run awk without any input files. If you type the following command line:
awk 'program'
awk applies the program
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Datafiles for the Examples
Many of the examples in this book take their input from two sample datafiles. The first, BBS-list, represents a list of computer bulletin-board systems together with information about those systems. The second datafile, called inventory-shipped, contains information about monthly shipments. In both files, each line is considered to be one record.
In the datafile BBS-list, each record contains the name of a computer bulletin board, its phone number, the board's baud rate(s), and a code for the number of hours it is operational. An A in the last column means the board operates 24 hours a day. A B in the last column means the board operates only on evening and weekend hours. A C means the board operates only on weekends:
aardvark     555-5553     1200/300          B
alpo-net     555-3412     2400/1200/300     A
barfly       555-7685     1200/300          A
bites        555-1675     2400/1200/300     A
camelot      555-0542     300               C
core         555-2912     1200/300          C
fooey        555-1234     2400/1200/300     B
foot         555-6699     1200/300          B
macfoo       555-6480     1200/300          A
sdace        555-3430     2400/1200/300     A
sabafoo      555-2127     1200/300          C
The datafile inventory-shipped represents information about shipments during the year. Each record contains the month, the number of green crates shipped, the number of red boxes shipped, the number of orange bags shipped, and the number of blue packages shipped, respectively. There are 16 entries, covering the 12 months of last year and the first 4 months of the current year:
Jan  13  25  15 115
Feb  15  32  24 226
Mar  15  24  34 228
Apr  31  52  63 420
May  16  34  29 208
Jun  31  42  75 492
Jul  24  34  67 436
Aug  15  34  47 316
Sep  13  55  37 277
Oct  29  54  68 525
Nov  20  87  82 577
Dec  17  35  61 401

Jan  21  36  64 620
Feb  26  58  80 652
Mar  24  75  70 495
Apr  21  70  74 514
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Some Simple Examples
The following command runs a simple awk program that searches the input file BBS-list for the character string foo (a grouping of characters is usually called a string; the term string is based on similar usage in English, such as "a string of pearls," or "a string of cars in a train"):
awk '/foo/ { print $0 }' BBS-list
When lines containing foo are found, they are printed because print $0 means print the current line. (Just print by itself means the same thing, so we could have written that instead.)
You will notice that slashes (/) surround the string foo in the awk program. The slashes indicate that foo is the pattern to search for. This type of pattern is called a regular expression, which is covered in more detail later (see Chapter 2). The pattern is allowed to match parts of words. There are single quotes around the awk program so that the shell won't interpret any of it as special shell characters.
Here is what this program prints:
$ awk '/foo/ { print $0 }' BBS-list
fooey        555-1234     2400/1200/300     B
foot         555-6699     1200/300          B
macfoo       555-6480     1200/300          A
sabafoo      555-2127     1200/300          C
In an awk rule, either the pattern or the action can be omitted, but not both. If the pattern is omitted, then the action is performed for every input line. If the action is omitted, the default action is to print all lines that match the pattern.
Thus, we could leave out the action (the print statement and the curly braces) in the previous example and the result would be the same: all lines matching the pattern foo are printed. By comparison, omitting the print statement but retaining the curly braces makes an empty action that does nothing (i.e., no lines are printed).
Many practical awk programs are just a line or two. Following is a collection of useful, short programs to get you started. Some of these programs contain constructs that haven't been covered yet. (The description of the program will give you a good idea of what is going on, but please read the rest of the book to become an
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
An Example with Two Rules
The awk utility reads the input files one line at a time. For each line, awk tries the patterns of each of the rules. If several patterns match, then several actions are run in the order in which they appear in the awk program. If no patterns match, then no actions are run.
After processing all the rules that match the line (and perhaps there are none), awk reads the next line. (However, see the Section 6.4.7 and also see the Section 6.4.8 in Chapter 6). This continues until the program reaches the end of the file. For example, the following awk program contains two rules:
/12/  { print $0 }
/21/  { print $0 }
The first rule has the string 12 as the pattern and print $0 as the action. The second rule has the string 21 as the pattern and also has print $0 as the action. Each rule's action is enclosed in its own pair of braces.
This program prints every line that contains the string 12 or the string 21. If a line contains both strings, it is printed twice, once by each rule.
This is what happens if we run this program on our two sample datafiles, BBS-list and inventory-shipped:
$ awk '/12/ { print $0 }
>      /21/ { print $0 }' BBS-list inventory-shipped
aardvark     555-5553     1200/300          B
alpo-net     555-3412     2400/1200/300     A
barfly       555-7685     1200/300          A
bites        555-1675     2400/1200/300     A
core         555-2912     1200/300          C
fooey        555-1234     2400/1200/300     B
foot         555-6699     1200/300          B
macfoo       555-6480     1200/300          A
sdace        555-3430     2400/1200/300     A
sabafoo      555-2127     1200/300          C
sabafoo      555-2127     1200/300          C
Jan  21  36  64 620
Apr  21  70  74 514
Note how the line beginning with sabafoo in BBS-list was printed twice, once for each rule.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A More Complex Example
Now that we've mastered some simple tasks, let's look at what typical awk programs do. This example shows how awk can be used to summarize, select, and rearrange the output of another utility. It uses features that haven't been covered yet, so don't worry if you don't understand all the details:
ls -l | awk '$6 == "Nov" { sum += $5 }
             END { print sum }'
This command prints the total number of bytes in all the files in the current directory that were last modified in November (of any year). The ls -l part of this example is a system command that gives you a listing of the files in a directory, including each file's size and the date the file was last modified. Its output looks like this:
-rw-r--r--  1 arnold   user   1933 Nov  7 13:05 Makefile
-rw-r--r--  1 arnold   user  10809 Nov  7 13:03 awk.h
-rw-r--r--  1 arnold   user    983 Apr 13 12:14 awk.tab.h
-rw-r--r--  1 arnold   user  31869 Jun 15 12:20 awk.y
-rw-r--r--  1 arnold   user  22414 Nov  7 13:03 awk1.c
-rw-r--r--  1 arnold   user  37455 Nov  7 13:03 awk2.c
-rw-r--r--  1 arnold   user  27511 Dec  9 13:07 awk3.c
-rw-r--r--  1 arnold   user   7989 Nov  7 13:03 awk4.c
The first field contains read-write permissions, the second field contains the number of links to the file, and the third field identifies the owner of the file. The fourth field identifies the group of the file. The fifth field contains the size of the file in bytes. The sixth, seventh, and eighth fields contain the month, day, and time, respectively, that the file was last modified. Finally, the ninth field contains the name of the file.
The $6 == "Nov" in our awk program is an expression that tests whether the sixth field of the output from ls -l matches the string Nov. Each time a line has the string Nov for its sixth field, the action sum += $5 is performed. This adds the fifth field (the file's size) to the variable sum. As a result, when awk has finished reading all the input lines,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
awk Statements Versus Lines
Most often, each line in an awk program is a separate statement or separate rule, like this:
awk '/12/  { print $0 }
     /21/  { print $0 }' BBS-list inventory-shipped
However, gawk ignores newlines after any of the following symbols and keywords:
,    {    ?    :    ||    &&    do    else
A newline at any other point is considered the end of the statement.
If you would like to split a single statement into two lines at a point where a newline would terminate it, you can continue it by ending the first line with a backslash character (\). The backslash must be the final character on the line in order to be recognized as a continuation character. A backslash is allowed anywhere in the statement, even in the middle of a string or regular expression. For example:
awk '/This regular expression is too long, so continue it\
 on the next line/ { print $1 }'
We have generally not used backslash continuation in the sample programs in this book. In gawk, there is no limit on the length of a line, so backslash continuation is never strictly necessary; it just makes programs more readable. For this same reason, as well as for clarity, we have kept most statements short in the sample programs presented throughout the book. Backslash continuation is most useful when your awk program is in a separate source file instead of entered from the command line. You should also note that many awk implementations are more particular about where you may use backslash continuation. For example, they may not allow you to split a string constant using backslash continuation. Thus, for maximum portability of your awk programs, it is best not to split your lines in the middle of a regular expression or a string.
Backslash continuation does not work as described with the C shell.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Other Features of awk
The awk language provides a number of predefined, or built-in, variables that your programs can use to get information from awk. There are other variables your program can set as well to control how awk processes your data.
In addition, awk provides a number of built-in functions for doing common computational and string-related operations. gawk provides built-in functions for working with timestamps, performing bit manipulation, and for runtime string translation.
As we develop our presentation of the awk language, we introduce most of the variables and many of the functions. They are defined systematically in the Section 6.5 in Chapter 6 and the Section 8.1 in Chapter 8.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
When to Use awk
Now that you've seen some of what awk can do, you might wonder how awk could be useful for you. By using utility programs, advanced patterns, field separators, arithmetic statements, and other selection criteria, you can produce much more complex output. The awk language is very useful for producing reports from large amounts of raw data, such as summarizing information from the output of other utility programs like ls. (See Section 1.5 earlier in this chapter.)
Programs written with awk are usually much smaller than they would be in other languages. This makes awk programs easy to compose and use. Often, awk programs can be quickly composed at your terminal, used once, and thrown away. Because awk programs are interpreted, you can avoid the (usually lengthy) compilation part of the typical edit-compile-test-debug cycle of software development.
Complex programs have been written in awk, including a complete retargetable assembler for eight-bit microprocessors (see the Glossary for more information), and a microcode assembler for a special-purpose Prolog computer. However, awk's capabilities are strained by tasks of such complexity.
If you find yourself writing awk scripts of more than, say, a few hundred lines, you might consider using a different programming language. Emacs Lisp is a good choice if you need sophisticated string or pattern matching capabilities. The shell is also good at string and pattern matching; in addition, it allows powerful use of the system utilities. More conventional languages, such as C, C++, and Java, offer better facilities for system programming and for managing the complexity of large programs. Programs in these languages may require more lines of source code than the equivalent awk programs, but they are easier to maintain and usually run more efficiently.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Regular Expressions
A regular expression, or regexp, is a way of describing a set of strings. Because regular expressions are such a fundamental part of awk programming, their format and use deserve a separate chapter.
A regular expression enclosed in slashes (/) is an awk pattern that matches every input record whose text belongs to that set. The simplest regular expression is a sequence of letters, numbers, or both. Such a regexp matches any string that contains that sequence. Thus, the regexp foo matches any string containing foo. Therefore, the pattern /foo/ matches any input record containing the three characters foo anywhere in the record. Other kinds of regexps let you specify more complicated classes of strings.
Initially, the examples in this chapter are simple. As we explain more about how regular expressions work, we will present more complicated instances.
A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is tested against the entire text of each record. (Normally, it only needs to match some part of the text in order to succeed.) For example, the following prints the second field of each record that contains the string foo anywhere in it:
$ awk '/foo/ { print $2 }' BBS-list
555-1234
555-6699
555-6480
555-2127
Regular expressions can also be used in matching expressions. These expressions allow you to specify the string to match against; it need not be the entire current input record. The two operators ~ and !~ perform regular expression comparisons. Expressions using these operators can be used as patterns, or in if, while, for, and do statements. (See the Section 6.4 in Chapter 6.) For example:
                  exp ~ /regexp/
is true if the expression
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How to Use Regular Expressions
A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is tested against the entire text of each record. (Normally, it only needs to match some part of the text in order to succeed.) For example, the following prints the second field of each record that contains the string foo anywhere in it:
$ awk '/foo/ { print $2 }' BBS-list
555-1234
555-6699
555-6480
555-2127
Regular expressions can also be used in matching expressions. These expressions allow you to specify the string to match against; it need not be the entire current input record. The two operators ~ and !~ perform regular expression comparisons. Expressions using these operators can be used as patterns, or in if, while, for, and do statements. (See the Section 6.4 in Chapter 6.) For example:
                  exp ~ /regexp/
is true if the expression exp (taken as a string) matches regexp. The following example matches, or selects, all input records with the uppercase letter J somewhere in the first field:
$ awk '$1 ~ /J/' inventory-shipped
Jan  13  25  15 115
Jun  31  42  75 492
Jul  24  34  67 436
Jan  21  36  64 620
So does this:
awk '{ if ($1 ~ /J/) print }' inventory-shipped
This next example is true if the expression exp (taken as a character string) does not match regexp:
                  exp !~ /regexp/
The following example matches, or selects, all input records whose first field does not contain the uppercase letter J:
$ awk '$1 !~ /J/' inventory-shipped
Feb  15  32  24 226
Mar  15  24  34 228
Apr  31  52  63 420
May  16  34  29 208
...
When a regexp is enclosed in slashes, such as /foo/, we call it a regexp constant, much like 5.27
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Escape Sequences
Some characters cannot be included literally in string constants ("foo") or regexp constants (/foo/). Instead, they should be represented with escape sequences, which are character sequences beginning with a backslash (\). One use of an escape sequence is to include a double-quote character in a string constant. Because a plain double quote ends the string, you must use \" to represent an actual double-quote character as a part of the string. For example:
$ awk 'BEGIN { print "He said \"hi!\" to her." }'
He said "hi!" to her.
The backslash character itself is another character that cannot be included normally; you must write \\ to put one backslash in the string or regexp. Thus, the string whose contents are the two characters " and \ must be written "\"\\".
Backslash also represents unprintable characters such as tab or newline. While there is nothing to stop you from entering most unprintable characters directly in a string constant or regexp constant, they may look ugly.
The following list describes all the escape sequences used in awk and what they represent. Unless noted otherwise, all these escape sequences apply to both string constants and regexp constants:
\\
A literal backslash, \.
\a
The "alert" character, Ctrl-g, ASCII code 7 (BEL). (This usually makes some sort of audible noise.)
\b
Backspace, Ctrl-h, ASCII code 8 (BS).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Regular Expression Operators
You can combine regular expressions with special characters, called regular expression operators or metacharacters, to increase the power and versatility of regular expressions.
The escape sequences described in the previous Section 2.2 are valid inside a regexp. They are introduced by a \ and are recognized and converted into corresponding real characters as the first step in processing regexps.
Here is a list of metacharacters. All characters that are not escape sequences and that are not listed here stand for themselves:
\
This is used to suppress the special meaning of a character when matching. For example, \$ matches the character $.
^
This matches the beginning of a string. For example, ^@chapter matches @chapter at the beginning of a string and can be used to identify chapter beginnings in Texinfo source files. The ^ is known as an anchor, because it anchors the pattern to match only at the beginning of the string.
It is important to realize that ^ does not match the beginning of a line embedded in a string. The condition is not true in the following example:
if ("line1\nLINE 2" ~ /^L/) ...
$
This is similar to ^, but it matches only at the end of a string. For example, p$ matches a record that ends with a p. The $ is an anchor and does not match the end of a line embedded in a string. The condition is not true as follows:
if ("line1\nLINE 2" ~ /1$/) ...
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using Character Lists
Within a character list, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, using the locale's collating sequence and character set. For example, in the default C locale, [a-dx-z] is equivalent to [abcdxyz]. Many locales sort characters in dictionary order, and in these locales, [a-dx-z] is typically not equivalent to [abcdxyz]; instead it might be equivalent to [aBbCcDdxXyYz], for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C.
To include one of the characters \, ], -, or ^ in a character list, put a \ in front of it. For example:
[d\]]
matches either d or ].
This treatment of \ in character lists is compatible with other awk implementations and is also mandated by POSIX. The regular expressions in awk are a superset of the POSIX specification for Extended Regular Expressions (EREs). POSIX EREs are based on the regular expressions accepted by the traditional egrep utility.
Character classes are a new feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but the actual characters can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs between the United States and France.
A character class is only valid in a regexp inside the brackets of a character list. Character classes consist of [:, a keyword denoting the class, and :]. Table 2-1 lists the character classes defined by the POSIX standard.
Table 2-1:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk; they are not available in other awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (_):
\w
Matches any word-constituent character -- that is, it matches any letter, digit, or underscore. Think of it as short-hand for [[:alnum:]_].
\W
Matches any character that is not word-constituent. Think of it as shorthand for [^[:alnum:]_].
\<
Matches the empty string at the beginning of a word. For example, /\<away/ matches away but not stowaway.
\>
Matches the empty string at the end of a word. For example, /stow\>/ matches stow but not stowaway.
\y
Matches the empty string at either the beginning or the end of a word (i.e., the word boundary). For example, \yballs?\y matches either ball or balls, as a separate word.
\B
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Case Sensitivity in Matching
Case is normally significant in regular expressions, both when matching ordinary characters (i.e., not metacharacters) and inside character sets. Thus, a w in a regular expression matches only a lowercase w and not an uppercase W.
The simplest way to do a case-independent match is to use a character list -- for example, [Ww]. However, this can be cumbersome if you need to use it often, and it can make the regular expressions harder to read. There are two alternatives that you might prefer.
One way to perform a case-insensitive match at a particular point in the program is to convert the data to a single case, using the tolower or toupper built-in string functions (which we haven't discussed yet; see the Section 8.1.3 in Chapter 8). For example:
tolower($1) ~ /foo/  { ... }
converts the first field to lowercase before matching against it. This works in any POSIX-compliant awk.
Another method, specific to gawk, is to set the variable IGNORECASE to a nonzero value (see the Section 6.5 in Chapter 6). When IGNORECASE is not zero, all regexp and string operations ignore case. Changing the value of IGNORECASE dynamically controls the case-sensitivity of the program as it runs. Case is significant by default because IGNORECASE (like most variables) is initialized to zero:
x = "aB"
if (x ~ /ab/) ...   # this test will fail

IGNORECASE = 1
if (x ~ /ab/) ...   # now it will succeed
In general, you cannot use IGNORECASE to make certain rules case-insensitive and other rules case-sensitive, because there is no straightforward way to set IGNORECASE just for the pattern of a particular rule. To do this, use either character lists or tolower. However, one thing you can do with IGNORECASE only is dynamically turn case-sensitivity on or off for all the rules at once.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Much Text Matches?
Consider the following:
echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }'
This example uses the sub function (which we haven't discussed yet; see the Section 8.1.3 in Chapter 8) to make a change to the input record. Here, the regexp /a+/ indicates "one or more a characters," and the replacement text is <A>.
The input contains four a characters. awk (and POSIX) regular expressions always match the leftmost, longest sequence of input characters that can match. Thus, all four a characters are replaced with <A> in this example:
$ echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }'
<A>bcd
For simple match/no-match tests, this is not so important. But when doing text matching and substitutions with the match, sub, gsub, and gensub functions, it is very important. Understanding this principle is also important for regexp-based record and field splitting (see the Section 3.1 and the Section 3.5 in Chapter 3).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using Dynamic Regexps
The righthand side of a ~ or !~ operator need not be a regexp constant (i.e., a string of characters between slashes). It may be any expression. The expression is evaluated and converted to a string if necessary; the contents of the string are used as the regexp. A regexp that is computed in this way is called a dynamic regexp:
BEGIN { digits_regexp = "[[:digit:]]+" }
$0 ~ digits_regexp    { print }
This sets digits_regexp to a regexp that describes one or more digits, and tests whether the input record matches this regexp.
When using the ~ and !~ operators, there is a difference between a regexp constant enclosed in slashes and a string constant enclosed in double quotes. If you are going to use a string constant, you have to understand that the string is, in essence, scanned twice: the first time when awk reads your program, and the second time when it goes to match the string on the lefthand side of the operator with the pattern on the right. This is true of any string-valued expression (such as digits_regexp, shown previously), not just string constants.
What difference does it make if the string is scanned twice? The answer has to do with escape sequences, and particularly with backslashes. To get a backslash into a regular expression inside a string, you have to type two backslashes.
For example, /\*/ is a regexp constant for a literal *. Only one backslash is needed. To do the same thing with a string, you have to type "\\*". The first backslash escapes the second one so that the string actually contains the two characters \ and *.
Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is "regexp constants," for several reasons:
  • String constants are more complicated to write and more difficult to read. Using regexp constants makes your programs less error-prone. Not understanding the difference between the two kinds of constants is a common source of errors.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Reading Input Files
In the typical awk program, all input is read either from the standard input (by default, this is the keyboard, but often it is a pipe from another command) or from files whose names you specify on the awk command line. If you specify input files, awk reads them in order, processing all the data from one before going on to the next. The name of the current input file can be found in the built-in variable FILENAME (see the Section 6.5 in Chapter 6).
The input is read in units called records, and is processed by the rules of your program one record at a time. By default, each record is one line. Each record is automatically split into chunks called fields. This makes it more convenient for programs to work on the parts of a record.
On rare occasions, you may need to use the getline command. The getline command is valuable, both because it can do explicit input from any number of files, and because the files used with it do not have to be named on the awk command line (see Section 3.8 later in this chapter).
The awk utility divides the input for your awk program into records and fields. awk keeps track of the number of records that have been read from the current input file. This value is stored in a built-in variable called FNR. It is reset to zero when a new file is started. Another built-in variable, NR, is the total number of input records read so far from all datafiles. It starts at zero, but is never automatically reset to zero.
Records are separated by a character called the record separator. By default, the record separator is the newline character. This is why records are, by default, single lines. A different character can be used for the record separator by assigning the character to the built-in variable RS.
Like any other variable, the value of RS can be changed in the awk program with the assignment operator,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Input Is Split into Records
The awk utility divides the input for your awk program into records and fields. awk keeps track of the number of records that have been read from the current input file. This value is stored in a built-in variable called FNR. It is reset to zero when a new file is started. Another built-in variable, NR, is the total number of input records read so far from all datafiles. It starts at zero, but is never automatically reset to zero.
Records are separated by a character called the record separator. By default, the record separator is the newline character. This is why records are, by default, single lines. A different character can be used for the record separator by assigning the character to the built-in variable RS.
Like any other variable, the value of RS can be changed in the awk program with the assignment operator, = (see the Section 5.7 in Chapter 5). The new record-separator character should be enclosed in quotation marks, which indicate a string constant. Often the right time to do this is at the beginning of execution, before any input is processed, so that the very first record is read with the proper separator. To do this, use the special BEGIN pattern (see the Section 6.1.4 in Chapter 6). For example:
awk 'BEGIN { RS = "/" }
     { print $0 }' BBS-list
changes the value of RS to "/", before reading any input. This is a string whose first character is a slash; as a result, records are separated by slashes. Then the input file is read, and the second rule in the awk program (the action with no pattern) prints each record. Because each print statement adds a newline at the end of its output, this awk program copies the input with each slash changed to a newline. Here are the results of running the program on BBS-list:
$ awk 'BEGIN { RS = "/" }
>      { print $0 }' BBS-list
aardvark     555-5553     1200
300          B
alpo-net     555-3412     2400
1200
300     A
barfly       555-7685     1200
300          A
bites        555-1675     2400
1200
300     A
camelot      555-0542     300               C
core         555-2912     1200
300          C
fooey        555-1234     2400
1200
300     B
foot         555-6699     1200
300          B
macfoo       555-6480     1200
300          A
sdace        555-3430     2400
1200
300     A
sabafoo      555-2127     1200
300          C

$
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Examining Fields
When awk reads an input record, the record is automatically parsed or separated by the interpreter into chunks called fields. By default, fields are separated by whitespace, like words in a line. Whitespace in awk means any string of one or more spaces, tabs, or newlines; other characters, such as formfeed, vertical tab, etc. that are considered whitespace by other languages, are not considered whitespace by awk.
The purpose of fields is to make it more convenient for you to refer to these pieces of the record. You don't have to use them -- you can operate on the whole record if you want -- but fields are what make simple awk programs so powerful.
A dollar-sign ($) is used to refer to a field in an awk program, followed by the number of the field you want. Thus, $1 refers to the first field, $2 to the second, and so on. (Unlike the Unix shells, the field numbers are not limited to single digits. $127 is the one hundred twenty-seventh field in the record.) For example, suppose the following is a line of input:
This seems like a pretty nice example.
Here the first field, or $1, is This, the second field, or $2, is seems, and so on. Note that the last field, $7, is example.. Because there is no space between the e and the ., the period is considered part of the seventh field.
NF is a built-in variable whose value is the number of fields in the current record. awk automatically updates the value of NF each time it reads a record. No matter how many fields there are, the last field in a record can be represented by $NF. So, $NF is the same as $7, which is example.. If you try to reference a field beyond the last one (such as $8 when the record has only seven fields), you get the empty string. (If used in a numeric operation, you get zero.)
The use of $0, which looks like a reference to the "zero-th" field, is a special case: it represents the whole input record when you are not interested in specific fields. Here are some more examples:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Non-constant Field Numbers
The number of a field does not need to be a constant. Any expression in the awk language can be used after a $ to refer to a field. The value of the expression specifies the field number. If the value is a string, rather than a number, it is converted to a number. Consider this example:
awk '{ print $NR }'
Recall that NR is the number of records read so far: one in the first record, two in the second, etc. So this example prints the first field of the first record, the second field of the second record, and so on. For the twentieth record, field number 20 is printed; most likely, the record has fewer than 20 fields, so this prints a blank line. Here is another example of using expressions as field numbers:
awk '{ print $(2*2) }' BBS-list
awk evaluates the expression (2*2) and uses its value as the number of the field to print. The * sign represents multiplication, so the expression 2*2 evaluates to four. The parentheses are used so that the multiplication is done before the $ operation; they are necessary whenever there is a binary operator in the field-number expression. This example, then, prints the hours of operation (the fourth field) for every line of the file BBS-list. (All of the awk operators are listed, in order of decreasing precedence, in the Section 5.14 in Chapter 5.)
If the field number you compute is zero, you get the entire record. Thus, $(2-2) has the same value as $0. Negative field numbers are not allowed; trying to reference one usually terminates the program. (The POSIX standard does not define what happens when you reference a negative field number. gawk notices this and terminates your program. Other awk implementations may behave differently.)
As mentioned earlier in the Section 3.2 awk stores the current record's number of fields in the built-in variable NF (also see the Section 6.5 in Chapter 6). The expression $NF is not a special feature -- it is the direct consequence of evaluating
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Changing the Contents of a Field
The contents of a field, as seen by awk, can be changed within an awk program; this changes what awk perceives as the current input record. (The actual input is untouched; awk never modifies the input file.) Consider the following example and its output:
$ awk '{ nboxes = $3 ; $3 = $3 - 10
>        print nboxes, $3 }' inventory-shipped
13 3
15 5
15 5
...
The program first saves the original value of field three in the variable nboxes. The - sign represents subtraction, so this program reassigns field three, $3, as the original value of field three minus ten: $3 - 10. (See the Section 5.5 in Chapter 5.) Then it prints the original and new values for field three. (Someone in the warehouse made a consistent mistake while inventorying the red boxes.)
For this to work, the text in field $2 must make sense as a number; the string of characters must be converted to a number for the computer to do arithmetic on it. The number resulting from the subtraction is converted back to a string of characters that then becomes field three. See the Section 5.4 in Chapter 5.
When the value of a field is changed (as perceived by awk), the text of the input record is recalculated to contain the new field where the old one was. In other words, $0 changes to reflect the altered field. Thus, this program prints a copy of the input file, with 10 subtracted from the second field of each line:
$ awk '{ $2 = $2 - 10; print $0 }' inventory-shipped
Jan 3 25 15 115
Feb 5 32 24 226
Mar 5 24 34 228
...
It is also possible to also assign contents to fields that are out of range. For example:
$ awk '{ $6 = ($5 + $4 + $3 + $2)
>        print $6 }' inventory-shipped
168
297
301
...
We've just created $6, whose value is the sum of fields $2, $3, $4, and $5. The + sign represents addition. For the file inventory-shipped, $6 represents the total number of parcels shipped for a particular month.
Creating a new field changes awk's internal copy of the current input record, which is the value of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Specifying How Fields Are Separated
The field separator, which is either a single character or a regular expression, controls the way awk splits an input record into fields. awk scans the input record for character sequences that match the separator; the fields themselves are the text between the matches.
In the examples that follow, we use a caret (^) to represent spaces in the output. If the field separator is oo, then the following line:
moo goo gai pan
is split into three fields: m, ^g, and ^gai^pan. Note the leading spaces in the values of the second and third fields.
The field separator is represented by the built-in variable FS. Shell programmers take note: awk does not use the name IFS that is used by the POSIX-compliant shells (such as the Unix Bourne shell, sh, or bash).
The value of FS can be changed in the awk program with the assignment operator, = (see the Section 5.7 in Chapter 5). Often the right time to do this is at the beginning of execution before any input has been processed, so that the very first record is read with the proper separator. To do this, use the special BEGIN pattern (see the Section 6.1.4 in Chapter 6). For example, here we set the value of FS to the string ",":