Looking to Reprint this content?
By Arnold Robbins
Price: $39.95 USD
£28.50 GBP
Cover | Table of Contents | Colophon
pattern { action }
pattern { action }
...
awk 'program' input-file1 input-file2 ...
awk -f program-file input-file1 input-file2 ...
awk 'program' input-file1 input-file2 ...
awk -f program-file input-file1 input-file2 ...
awk 'program' input-file1 input-file2 ...
awk 'program'
A in the last column means the board operates 24
hours a day. A B in the last column means the board operates only on evening and weekend hours. A C
means the board operates only on weekends:
aardvark 555-5553 1200/300 B alpo-net 555-3412 2400/1200/300 A barfly 555-7685 1200/300 A bites 555-1675 2400/1200/300 A camelot 555-0542 300 C core 555-2912 1200/300 C fooey 555-1234 2400/1200/300 B foot 555-6699 1200/300 B macfoo 555-6480 1200/300 A sdace 555-3430 2400/1200/300 A sabafoo 555-2127 1200/300 C
Jan 13 25 15 115 Feb 15 32 24 226 Mar 15 24 34 228 Apr 31 52 63 420 May 16 34 29 208 Jun 31 42 75 492 Jul 24 34 67 436 Aug 15 34 47 316 Sep 13 55 37 277 Oct 29 54 68 525 Nov 20 87 82 577 Dec 17 35 61 401 Jan 21 36 64 620 Feb 26 58 80 652 Mar 24 75 70 495 Apr 21 70 74 514
foo (a grouping of characters is usually
called a string; the term
string is based on similar usage in English, such
as "a string of pearls," or "a string of cars in a train"):
awk '/foo/ { print $0 }' BBS-list
foo are found, they are
printed because print $0 means print the current
line. (Just print by itself means the same thing,
so we could have written that instead.)
/) surround the string
foo in the awk program. The
slashes indicate that foo is the pattern to search
for. This type of pattern is called a regular
expression, which is covered in more detail later (see
Chapter 2).
The pattern is allowed to match parts of
words. There are single quotes around the awk
program so that the shell won't interpret any of it as special shell
characters.
$ awk '/foo/ { print $0 }' BBS-list
fooey 555-1234 2400/1200/300 B
foot 555-6699 1200/300 B
macfoo 555-6480 1200/300 A
sabafoo 555-2127 1200/300 C
print
statement and the curly braces) in the previous example and the result
would be the same: all lines matching the pattern
foo are printed. By comparison, omitting the
print statement but retaining the curly braces makes
an empty action that does nothing (i.e., no lines are printed).
/12/ { print $0 }
/21/ { print $0 }
12 as the pattern and
print $0 as the action. The second rule has the
string 21 as the pattern and also has print
$0 as the action. Each rule's action is enclosed in its own
pair of braces.
12
or the string
21. If a line contains both strings, it is printed
twice, once by each rule.
$ awk '/12/ { print $0 }
> /21/ { print $0 }' BBS-list inventory-shipped
aardvark 555-5553 1200/300 B
alpo-net 555-3412 2400/1200/300 A
barfly 555-7685 1200/300 A
bites 555-1675 2400/1200/300 A
core 555-2912 1200/300 C
fooey 555-1234 2400/1200/300 B
foot 555-6699 1200/300 B
macfoo 555-6480 1200/300 A
sdace 555-3430 2400/1200/300 A
sabafoo 555-2127 1200/300 C
sabafoo 555-2127 1200/300 C
Jan 21 36 64 620
Apr 21 70 74 514
sabafoo in
BBS-list was printed twice, once for each rule.
ls -l | awk '$6 == "Nov" { sum += $5 }
END { print sum }'
ls -l part of this example is a system command
that gives you a listing of the files in a directory, including each
file's size and the date the file was last modified. Its output looks
like this:
-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile -rw-r--r-- 1 arnold user 10809 Nov 7 13:03 awk.h -rw-r--r-- 1 arnold user 983 Apr 13 12:14 awk.tab.h -rw-r--r-- 1 arnold user 31869 Jun 15 12:20 awk.y -rw-r--r-- 1 arnold user 22414 Nov 7 13:03 awk1.c -rw-r--r-- 1 arnold user 37455 Nov 7 13:03 awk2.c -rw-r--r-- 1 arnold user 27511 Dec 9 13:07 awk3.c -rw-r--r-- 1 arnold user 7989 Nov 7 13:03 awk4.c
$6 == "Nov" in our awk
program is an expression that tests whether the sixth field of the
output from ls -l matches the string
Nov. Each time a line has the string
Nov for its sixth field, the action sum +=
$5 is performed. This adds the fifth field (the file's size)
to the variable sum. As a result, when
awk has finished reading all the input lines,
awk '/12/ { print $0 }
/21/ { print $0 }' BBS-list inventory-shipped
, { ? : || && do else
\). The backslash must be the
final character on the line in order to be recognized as a continuation
character. A backslash is allowed anywhere in the statement, even in
the middle of a string or regular expression. For example:
awk '/This regular expression is too long, so continue it\
on the next line/ { print $1 }'
/) is an
awk pattern that matches every input record whose
text belongs to that set. The simplest regular expression is a
sequence of letters, numbers, or both. Such a regexp matches any
string that contains that sequence. Thus, the regexp
foo matches any string containing
foo. Therefore, the pattern
/foo/ matches any input record containing the three
characters foo
anywhere in the
record. Other kinds of regexps let you specify more complicated
classes of strings.
foo anywhere in it:
$ awk '/foo/ { print $2 }' BBS-list
555-1234
555-6699
555-6480
555-2127
~ and !~ perform regular
expression comparisons. Expressions using these operators can be used
as patterns, or in if, while,
for, and do statements. (See
the Section 6.4 in Chapter 6.)
For example:
exp ~ /regexp/
foo anywhere in it:
$ awk '/foo/ { print $2 }' BBS-list
555-1234
555-6699
555-6480
555-2127
~ and !~ perform regular
expression comparisons. Expressions using these operators can be used
as patterns, or in if, while,
for, and do statements. (See
the Section 6.4 in Chapter 6.)
For example:
exp ~ /regexp/
J somewhere in the first field:
$ awk '$1 ~ /J/' inventory-shipped Jan 13 25 15 115 Jun 31 42 75 492 Jul 24 34 67 436 Jan 21 36 64 620
awk '{ if ($1 ~ /J/) print }' inventory-shipped
exp !~ /regexp/
J:
$ awk '$1 !~ /J/' inventory-shipped Feb 15 32 24 226 Mar 15 24 34 228 Apr 31 52 63 420 May 16 34 29 208 ...
/foo/,
we call it a regexp constant, much like
5.27"foo") or regexp constants
(/foo/). Instead, they should be represented with
escape sequences, which are character sequences
beginning with a backslash (\). One use of an
escape sequence is to include a double-quote character in a string
constant. Because a plain double quote ends the string, you must use
\" to represent an actual double-quote character as a
part of the string. For example:
$ awk 'BEGIN { print "He said \"hi!\" to her." }'
He said "hi!" to her.
\\ to put one
backslash in the string or regexp. Thus, the string whose contents are
the two characters " and \ must
be written "\"\\".
\\
\.
\a
\b
\ and are recognized and converted
into corresponding real characters as the first step in
processing regexps.
\
\$ matches the
character $.
^
^@chapter matches @chapter
at the beginning of a string and can be used to identify chapter
beginnings in Texinfo source files. The ^ is
known as an anchor, because it anchors the
pattern to match only at the beginning of the string.
^ does not
match the beginning of a line embedded in a string. The condition
is not true in the following example:
if ("line1\nLINE 2" ~ /^L/) ...
$
^, but it matches only at the
end of a string. For example, p$ matches a
record that ends with a p. The
$ is an anchor and does not match the end of a
line embedded in a string. The condition is not true as follows:
if ("line1\nLINE 2" ~ /1$/) ...
[a-dx-z] is equivalent to
[abcdxyz]. Many locales sort characters in
dictionary order, and in these locales, [a-dx-z] is
typically not equivalent to [abcdxyz]; instead it
might be equivalent to [aBbCcDdxXyYz], for example.
To obtain the traditional interpretation of bracket expressions, you
can use the C locale by setting the LC_ALL environment
variable to the value C.
\,
], -, or ^ in a
character list, put a \ in front of it. For example:
[d\]]
d or ].
\ in character lists is compatible
with other awk implementations and is also mandated
by POSIX. The regular expressions in awk are a
superset of the POSIX specification for Extended Regular Expressions
(EREs). POSIX EREs are based on the regular expressions accepted by the
traditional egrep utility.
[:, a keyword denoting the class, and
:].
Table 2-1 lists the character classes defined by the
POSIX standard.
_):
\w
[[:alnum:]_].
\W
[^[:alnum:]_].
\<
/\<away/ matches away but
not stowaway.
\>
/stow\>/ matches stow but
not stowaway.
\y
\yballs?\y matches either
ball or balls, as a separate
word.
\B
w in a regular expression matches
only a lowercase w and not an uppercase
W.
[Ww]. However, this can be
cumbersome if you need to use it often, and it can make the regular
expressions harder to read. There are two alternatives that you might
prefer.
tolower or toupper built-in
string functions (which we haven't discussed yet; see
the Section 8.1.3 in Chapter 8).
For example:
tolower($1) ~ /foo/ { ... }
IGNORECASE to a nonzero value (see
the Section 6.5 in Chapter 6).
When IGNORECASE
is not zero, all regexp and string operations
ignore case. Changing the value of IGNORECASE
dynamically controls the case-sensitivity of the program as it runs.
Case is significant by default because IGNORECASE
(like most variables) is initialized to zero:
x = "aB" if (x ~ /ab/) ... # this test will fail IGNORECASE = 1 if (x ~ /ab/) ... # now it will succeed
IGNORECASE to make
certain rules case-insensitive and other rules case-sensitive, because
there is no straightforward way to set IGNORECASE
just for the pattern of a particular rule.
To do this, use either character lists or tolower.
However, one thing you can do with IGNORECASE only
is dynamically turn case-sensitivity on or off for all the rules at
once.
echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }'
sub function (which we haven't
discussed yet; see
the Section 8.1.3 in Chapter 8)
to make a
change to the input record. Here, the regexp /a+/
indicates "one or more a characters," and the
replacement text is <A>.
a characters.
awk (and POSIX) regular expressions always match the
leftmost, longest sequence of input characters
that can match. Thus, all four a characters are
replaced with <A> in this example:
$ echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }'
<A>bcd
match, sub,
gsub, and gensub functions, it is
very important. Understanding this principle is also important for
regexp-based record and field splitting (see
the Section 3.1
and
the Section 3.5 in Chapter 3).
~ or !~
operator need not be a regexp constant (i.e., a string of characters
between slashes). It may be any expression. The expression is
evaluated and converted to a string if necessary; the contents of the
string are used as the regexp. A regexp that is computed in this way
is called a dynamic regexp:
BEGIN { digits_regexp = "[[:digit:]]+" }
$0 ~ digits_regexp { print }
digits_regexp to a regexp that describes
one or more digits, and tests whether the input record matches this
regexp.
~ and !~ operators, there is a
difference between a regexp constant enclosed in slashes and a string
constant enclosed in double quotes. If you are going to use a string
constant, you have to understand that the string is, in essence,
scanned twice: the first time when
awk reads your program, and the second time when it
goes to match the string on the lefthand side of the operator with the
pattern on the right. This is true of any string-valued expression
(such as digits_regexp, shown previously), not just
string constants.
/\*/ is a regexp constant for a literal
*. Only one backslash is needed. To do the same
thing with a string, you have to type "\\*". The
first backslash escapes the second one so that the string actually
contains the two characters \ and
*.
FILENAME (see
the Section 6.5 in Chapter 6).
getline
command. The getline command is valuable, both
because it can do explicit input from any number of files, and because
the files used with it do not have to be named on the
awk command line (see Section 3.8 later in this chapter).
FNR. It is reset to zero
when a new file is started. Another built-in variable,
NR, is the total number of input records read so far
from all datafiles. It starts at zero, but is never automatically
reset to zero.
RS.
RS can be
changed in the awk program with the assignment
operator, FNR. It is reset to zero
when a new file is started. Another built-in variable,
NR, is the total number of input records read so far
from all datafiles. It starts at zero, but is never automatically
reset to zero.
RS.
RS can be
changed in the awk program with the assignment
operator, = (see
the Section 5.7 in Chapter 5).
The new record-separator character should be enclosed in quotation
marks, which indicate a string constant. Often the right time to do
this is at the beginning of execution, before any input is processed,
so that the very first record is read with the proper separator. To do
this, use the special BEGIN pattern (see
the Section 6.1.4 in Chapter 6).
For example:
awk 'BEGIN { RS = "/" }
{ print $0 }' BBS-list
RS to "/",
before reading any input. This is a string whose first character is a
slash; as a result, records are separated by slashes. Then the input
file is read, and the second rule in the awk program
(the action with no pattern) prints each record. Because each
print statement adds a newline at the end of its
output, this awk program copies
the input with each slash changed to a newline. Here are the results
of running the program on BBS-list:
$ awk 'BEGIN { RS = "/" }
> { print $0 }' BBS-list
aardvark 555-5553 1200
300 B
alpo-net 555-3412 2400
1200
300 A
barfly 555-7685 1200
300 A
bites 555-1675 2400
1200
300 A
camelot 555-0542 300 C
core 555-2912 1200
300 C
fooey 555-1234 2400
1200
300 B
foot 555-6699 1200
300 B
macfoo 555-6480 1200
300 A
sdace 555-3430 2400
1200
300 A
sabafoo 555-2127 1200
300 C
$
$) is used to refer to a field in an
awk program, followed by the number of the field you
want. Thus, $1 refers to the first field,
$2 to the second, and so on. (Unlike the Unix
shells, the field numbers are not limited to single digits.
$127 is the one hundred twenty-seventh field in
the record.) For example, suppose the following is a line of input:
This seems like a pretty nice example.
$1, is
This, the second field, or $2, is
seems, and so on. Note that the last field,
$7, is example.. Because there
is no space between the e and the
., the period is considered part of the seventh
field.
NF is a built-in variable whose value is the number
of fields in the current record. awk automatically
updates the value of NF each time it reads a record.
No matter how many fields there are, the last field in a record can be
represented by $NF. So, $NF is
the same as $7, which is example..
If you try to reference a field beyond the last one (such as
$8 when the record has only seven fields), you get
the empty string. (If used in a numeric operation, you get zero.)
$0, which looks like a reference to the
"zero-th" field, is a special case: it represents the whole input record
when you are not interested in specific fields. Here are some more
examples:
$ to refer to a field. The value of the expression
specifies the field number. If the value is a string, rather than a
number, it is converted to a number. Consider this example:
awk '{ print $NR }'
NR is the number of records read so far:
one in the first record, two in the second, etc. So this example
prints the first field of the first record, the second field of the
second record, and so on. For the twentieth record, field number 20 is
printed; most likely, the record has fewer than 20 fields, so this
prints a blank line. Here is another example of using expressions as
field numbers:
awk '{ print $(2*2) }' BBS-list
(2*2) and uses its value as the number of the field
to print. The * sign represents multiplication, so
the expression 2*2 evaluates to four. The
parentheses are used so that the multiplication is done before the
$ operation; they are necessary whenever there is a
binary operator in the field-number expression. This example, then,
prints the hours of operation (the fourth field) for every line of the
file BBS-list. (All of the awk
operators are listed, in order of decreasing precedence, in
the Section 5.14 in Chapter 5.)
$(2-2) has the same value as
$0. Negative field numbers are not allowed; trying
to reference one usually terminates the program. (The POSIX standard
does not define what happens when you reference a negative field
number. gawk notices this and terminates your
program. Other awk implementations may behave
differently.)
NF (also see
the Section 6.5 in Chapter 6).
The expression
$NF is not a special feature -- it is the direct
consequence of evaluating
$ awk '{ nboxes = $3 ; $3 = $3 - 10
> print nboxes, $3 }' inventory-shipped
13 3
15 5
15 5
...
nboxes. The - sign
represents subtraction, so this program reassigns field three,
$3, as the original value of field three minus ten:
$3 - 10. (See
the Section 5.5 in Chapter 5.)
Then it prints the original and new values for field three. (Someone
in the warehouse made a consistent mistake while inventorying the red
boxes.)
$2 must make
sense as a number; the string of characters must be converted to a
number for the computer to do arithmetic on it. The number resulting
from the subtraction is converted back to a string of characters that
then becomes field three. See
the Section 5.4 in Chapter 5.
$0 changes to reflect the altered field. Thus, this
program prints a copy of the input file, with 10 subtracted from the
second field of each line:
$ awk '{ $2 = $2 - 10; print $0 }' inventory-shipped
Jan 3 25 15 115
Feb 5 32 24 226
Mar 5 24 34 228
...
$ awk '{ $6 = ($5 + $4 + $3 + $2)
> print $6 }' inventory-shipped
168
297
301
...
$6, whose value is the sum of
fields $2, $3,
$4, and $5. The
+ sign represents addition. For the file
inventory-shipped, $6
represents the total number of parcels shipped for a particular month.
oo,
then the following line:
moo goo gai pan
m,
^g, and ^gai^pan. Note the
leading spaces in the values of the second and third fields.
FS. Shell programmers take note:
awk does not use the name
IFS that is used by the POSIX-compliant shells (such
as the Unix Bourne shell, sh, or bash).
FS can be changed in the
awk program with the assignment operator,
= (see
the Section 5.7 in Chapter 5).
Often
the right time to do this is at the beginning of execution before any
input has been processed, so that the very first record
is read with
the proper separator. To do this, use the special
BEGIN pattern (see
the Section 6.1.4 in Chapter 6).
For example, here we set the value of FS to the
string ",":