Chapter 4. Reading Input Files
In the typical awk
program,
awk
reads all input either from the
standard input (by default, this is the keyboard, but often it is a pipe
from another command) or from files whose names you specify on the awk
command line. If you specify input files,
awk
reads them in order, processing all
the data from one before going on to the next. The name of the current input file can be found in the
predefined variable FILENAME
(see Predefined Variables).
The input is read in units called records, and is processed by the rules of your program one record at a time. By default, each record is one line. Each record is automatically split into chunks called fields. This makes it more convenient for programs to work on the parts of a record.
On rare occasions, you may need to use the getline
command. The getline
command is valuable both because it can do
explicit input from any number of files, and because the files used with it
do not have to be named on the awk
command line (see Explicit Input with getline).
How Input Is Split into Records
awk
divides the input for your
program into records and fields. It keeps track of the number of records
that have been read so far from the current input file. This value is stored in a predefined variable called
FNR
, which is reset to zero every time
a new file is started. Another predefined variable, NR
, records the total number of input records
read so far from all datafiles. It starts at zero, but is never
automatically reset to zero.
Record Splitting with Standard awk
Records are separated by a character called the record
separator. By default, the record separator is the newline
character. This is why records are, by default, single lines. To use a
different character for the record separator, simply assign that
character to the predefined variable RS
.
Like any other variable, the value of RS
can be changed in the awk
program with the assignment operator,
‘=
’ (see Assignment Expressions). The new record-separator character should
be enclosed in quotation marks, which indicate a string constant.
Often, the right time to do this is at the beginning of
execution, before any input is processed, so that the very first record
is read with the proper separator. To do this, use the special BEGIN
pattern (see The BEGIN and END Special Patterns). For example:
awk 'BEGIN { RS = "u" } { print $0 }' mail-list
changes the value of RS
to
‘u
’, before reading any input. The
new value is a string whose first character is the letter “u”; as a
result, records are separated by the letter “u”. Then the input file is
read, and the second rule in the awk
program (the action with no pattern) prints each record. Because each print
statement adds a newline at the end of its output, this awk
program copies the input with each
‘u
’ changed to a newline. Here are
the results of running the program on mail-list
:
$awk 'BEGIN { RS = "u" }
>{ print $0 }' mail-list
Amelia 555-5553 amelia.zodiac sq e@gmail.com F Anthony 555-3412 anthony.assert ro@hotmail.com A Becky 555-7685 becky.algebrar m@gmail.com A Bill 555-1675 bill.drowning@hotmail.com A Broderick 555-0542 broderick.aliq otiens@yahoo.com R Camilla 555-2912 camilla.inf sar m@skynet.be R Fabi s 555-1234 fabi s. ndevicesim s@ cb.ed F J lie 555-6699 j lie.perscr tabor@skeeve.com F Martin 555-6480 martin.codicib s@hotmail.com A Sam el 555-3430 sam el.lanceolis@sh .ed A Jean-Pa l 555-2127 jeanpa l.campanor m@ny .ed R
Note that the entry for the name ‘Bill
’ is not split. In the original datafile
(see Datafiles for the Examples), the line looks like
this:
Bill 555-1675 bill.drowning@hotmail.com A
It contains no ‘u
’, so there is
no reason to split the record, unlike the others, which each have one or
more occurrences of the ‘u
’. In fact,
this record is treated as part of the previous record; the newline
separating them in the output is the original newline in the datafile,
not the one added by awk
when it
printed the record!
Another way to change the record separator is on the command line, using the variable-assignment feature (see Other Command-Line Arguments):
awk '{ print $0 }' RS="u" mail-list
This sets RS
to ‘u
’ before processing mail-list
.
Using an alphabetic character such as ‘u
’ for the record separator is highly likely
to produce strange results. Using an unusual character such as ‘/
’ is more likely to produce correct behavior
in the majority of cases, but there are no guarantees. The moral is:
Know Your Data.
When using regular characters as the record separator, there is
one unusual case that occurs when gawk
is being fully POSIX-compliant (see Command-Line Options). Then, the following (extreme) pipeline prints a
surprising ‘1
’:
$ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
1
There is one field, consisting of a newline. The value of the
built-in variable NF
is the number of
fields in the current record. (In the normal case, gawk
treats the newline as whitespace,
printing ‘0
’ as the result. Most
other versions of awk
also act this
way.)
Reaching the end of an input file terminates the current input
record, even if the last character in the file is not the character in
RS
. (d.c.)
The empty string ""
(a string
without any characters) has a special meaning as the value of RS
. It means that records are separated by one
or more blank lines and nothing else. See Multiple-Line Records for more details.
If you change the value of RS
in the middle of an awk
run, the new
value is used to delimit subsequent records, but the record currently
being processed, as well as records already processed, are not
affected.
After the end of the record has been determined, gawk
sets the variable RT
to the text in the input that matched
RS
.
Record Splitting with gawk
When using gawk
, the value of
RS
is not limited to a one-character
string. It can be any regular expression (see Chapter 3). (c.e.) In general, each record ends at the next
string that matches the regular expression; the next record starts at
the end of the matching string. This general rule is actually at work in the usual case,
where RS
contains just a newline: a
record ends at the beginning of the next matching string (the next
newline in the input), and the following record starts just after the
end of this string (at the first character of the following line). The
newline, because it matches RS
, is
not part of either record.
When RS
is a single character,
RT
contains the same single
character. However, when RS
is a
regular expression, RT
contains the
actual input text that matched the regular
expression.
If the input file ends without any text matching RS
, gawk
sets RT
to the null string.
The following example illustrates both of these features. It sets
RS
equal to a regular expression that
matches either a newline or a series of one or more uppercase letters
with optional leading and/or trailing whitespace:
$echo record 1 AAAA record 2 BBBB record 3 |
>gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
>{ print "Record =", $0,"and RT = [" RT "]" }'
Record = record 1 and RT = [ AAAA ] Record = record 2 and RT = [ BBBB ] Record = record 3 and RT = [ ]
The square brackets delineate the contents of RT
, letting you see the leading and trailing
whitespace. The final value of RT
is
a newline. See A Simple Stream Editor for a more useful example
of RS
as a regexp and RT
.
If you set RS
to a regular
expression that allows optional trailing text, such as ‘RS = "abc(XYZ)?"
’, it is possible, due to
implementation constraints, that gawk
may match the leading part of the regular expression, but not the
trailing part, particularly if the input text that could match the
trailing part is fairly long. gawk
attempts to avoid this problem, but currently, there’s no guarantee that
this will never happen.
Note
Remember that in awk
, the
‘^
’ and ‘$
’ anchor metacharacters match the beginning
and end of a string, and not the beginning and end of a
line. As a result, something like ‘RS = "^[[:upper:]]"
’ can only match at the
beginning of a file. This is because gawk
views the input file as one long string
that happens to contain newline characters. It is thus best to avoid
anchor metacharacters in the value of RS
.
The use of RS
as a regular
expression and the RT
variable are
gawk
extensions; they are not
available in compatibility mode (see Command-Line Options). In
compatibility mode, only the first character of the value of RS
determines the end of the record.
Examining Fields
When awk
reads an input record,
the record is automatically parsed or separated by
the awk
utility into chunks called
fields. By default, fields are separated by
whitespace, like words in a line. Whitespace in awk
means
any string of one or more spaces, TABs, or newlines;[18] other characters that are considered whitespace by other
languages (such as formfeed, vertical tab, etc.) are
not considered whitespace by awk
.
The purpose of fields is to make it more convenient for you to refer
to these pieces of the record. You don’t have to use them—you can operate
on the whole record if you want—but fields are what make simple awk
programs so powerful.
You use a dollar sign (‘$
’) to
refer to a field in an awk
program,
followed by the number of the field you want. Thus, $1
refers to the
first field, $2
to the second, and so
on. (Unlike in the Unix shells, the field numbers are not limited to
single digits. $127
is the 127th field
in the record.) For example, suppose the following is a line of
input:
This seems like a pretty nice example.
Here the first field, or $1
, is
‘This
’, the second field, or $2
, is ‘seems
’, and so on. Note that the last field,
$7
, is ‘example.
’. Because there is no space between the
‘e
’ and the ‘.
’, the period is considered part of the seventh
field.
NF
is a predefined variable
whose value is the number of fields in the current record.
awk
automatically updates the value of
NF
each time it reads a record. No
matter how many fields there are, the last field in a record can be
represented by $NF
. So, $NF
is the same as $7
, which is ‘example.
’. If you try to reference a field
beyond the last one (such as $8
when
the record has only seven fields), you get the empty string. (If used in a
numeric operation, you get zero.)
The use of $0
, which looks like a
reference to the “zeroth” field, is a special case: it represents the
whole input record. Use it when you are not interested in specific fields.
Here are some more examples:
$ awk '$1 ~ /li/ { print $0 }' mail-list
Amelia 555-5553 amelia.zodiacusque@gmail.com F
Julie 555-6699 julie.perscrutabor@skeeve.com F
This example prints each record in the file mail-list
whose first field contains the string
‘li
’.
By contrast, the following example looks for ‘li
’ in the entire record
and prints the first and last fields for each matching input
record:
$ awk '/li/ { print $1, $NF }' mail-list
Amelia F
Broderick R
Julie F
Samuel A
Nonconstant Field Numbers
A field number need not be a constant. Any expression in the
awk
language can be used after a
‘$
’ to refer to a field. The value of the expression specifies the field number. If
the value is a string, rather than a number, it is converted to a number.
Consider this example:
awk '{ print $NR }'
Recall that NR
is the number of
records read so far: one in the first record, two in the second,
and so on. So this example prints the first field of the first record, the
second field of the second record, and so on. For the twentieth record,
field number 20 is printed; most likely, the record has fewer than 20
fields, so this prints a blank line. Here is another example of using
expressions as field numbers:
awk '{ print $(2*2) }' mail-list
awk
evaluates the expression
‘(2*2)
’ and uses its value as the
number of the field to print. The ‘*
’
represents multiplication, so the expression ‘2*2
’ evaluates to four. The parentheses are used
so that the multiplication is done before the ‘$
’ operation; they are necessary whenever there is a binary operator[19] in the field-number expression. This example, then, prints
the type of relationship (the fourth field) for every line of the file
mail-list
. (All of the awk
operators are listed, in order of decreasing
precedence, in Operator Precedence (How Operators Nest).)
If the field number you compute is zero, you get the entire record.
Thus, ‘$(2-2)
’ has the same value as
$0
. Negative field numbers are not
allowed; trying to reference one usually terminates the program. (The
POSIX standard does not define what happens when you reference a negative
field number. gawk
notices this and
terminates your program. Other awk
implementations may behave differently.)
As mentioned in Examining Fields, awk
stores the current record’s number of fields
in the built-in variable NF
(also see
Predefined Variables). Thus, the expression $NF
is not a special feature—it is the direct
consequence of evaluating NF
and using
its value as a field number.
Changing the Contents of a Field
The contents of a field, as seen by awk
, can be changed within an awk
program; this changes what awk
perceives as the current input record.
(The actual input is untouched; awk
never modifies the
input file.) Consider the following example and its output:
$awk '{ nboxes = $3 ; $3 = $3 - 10
>print nboxes, $3 }' inventory-shipped
25 15 32 22 24 14 …
The program first saves the original value of field three in the
variable nboxes
. The ‘-
’ sign represents subtraction, so this program reassigns field three, $3
, as the original value of field three minus
ten: ‘$3 - 10
’. (See Arithmetic Operators.) Then it prints the original and new values
for field three. (Someone in the warehouse made a consistent mistake while
inventorying the red boxes.)
For this to work, the text in $3
must make sense as a number; the string of characters must be converted to
a number for the computer to do arithmetic on it. The number resulting
from the subtraction is converted back to a string of characters that then
becomes field three. See Conversion of Strings and Numbers.
When the value of a field is changed (as perceived by awk
), the text of the input record is
recalculated to contain the new field where the old one was. In other
words, $0
changes to reflect the
altered field. Thus, this program prints a copy of the input file, with 10
subtracted from the second field of each line:
$ awk '{ $2 = $2 - 10; print $0 }' inventory-shipped
Jan 3 25 15 115
Feb 5 32 24 226
Mar 5 24 34 228
…
It is also possible to assign contents to fields that are out of range. For example:
$awk '{ $6 = ($5 + $4 + $3 + $2)
>print $6 }' inventory-shipped
166 297 301 …
We’ve just created $6
, whose
value is the sum of fields $2
, $3
, $4
, and
$5
. The ‘+
’ sign represents addition. For the file
inventory-shipped
, $6
represents the total number of parcels
shipped for a particular month.
Creating a new field changes awk
’s internal copy of the current input record,
which is the value of $0
. Thus, if you
do ‘print $0
’ after adding a field, the
record printed includes the new field, with the appropriate number of
field separators between it and the previously existing fields.
This recomputation affects and is affected by NF
(the number of fields; see Examining Fields). For example, the value of NF
is set to the number of the highest field you
create. The exact format of $0
is also
affected by a feature that has not been discussed yet: the
output field separator, OFS
, used to separate the fields (see Output Separators).
Note, however, that merely referencing an
out-of-range field does not change the value of
either $0
or NF
. Referencing an out-of-range field only produces an empty
string. For example:
if ($(NF+1) != "") print "can't happen" else print "everything is normal"
should print ‘everything is
normal
’, because NF+1
is
certain to be out of range. (See The if-else Statement for more
information about awk
’s if-else
statements. See Variable Typing and Comparison Expressions for more information about the
‘!=
’ operator.)
It is important to note that making an assignment to an existing
field changes the value of $0
but does
not change the value of NF
, even when
you assign the empty string to a field. For example:
$echo a b c d | awk '{ OFS = ":"; $2 = ""
>print $0; print NF }'
a::c:d 4
The field is still there; it just has an empty value,
delimited by the two colons between ‘a
’ and ‘c
’.
This example shows what happens if you create a new field:
$echo a b c d | awk '{ OFS = ":"; $2 = ""; $6 = "new"
>print $0; print NF }'
a::c:d::new 6
The intervening field, $5
, is
created with an empty value (indicated by the second pair of adjacent
colons), and NF
is updated with the
value six.
Decrementing NF
throws away the
values of the fields after the new value of NF
and recomputes $0
. (d.c.) Here is an example:
$echo a b c d e f | awk '{ print "NF =", NF;
>NF = 3; print $0 }'
NF = 6 a b c
Caution
Some versions of awk
don’t
rebuild $0
when NF
is decremented.
Finally, there are times when it is convenient to force awk
to rebuild the entire record, using the current values of the
fields and OFS
. To do this, use the
seemingly innocuous assignment:
$1 = $1 # force record to be reconstituted print $0 # or whatever else with $0
This forces awk
to rebuild the
record. It does help to add a comment, as we’ve shown here.
There is a flip side to the relationship between $0
and the fields. Any assignment to $0
causes the record to be reparsed into fields
using the current value of FS
. This also applies to any built-in function
that updates $0
, such as sub()
and gsub()
(see String-Manipulation Functions).
Specifying How Fields Are Separated
The field separator, which is either a single
character or a regular expression, controls the way awk
splits an input record into fields. awk
scans the input record for character
sequences that match the separator; the fields themselves are the text
between the matches.
In the examples that follow, we use the bullet symbol (•) to
represent spaces in the output. If the field separator is ‘oo
’, then the following line:
moo goo gai pan
is split into three fields: ‘m
’,
‘•g
’, and ‘•gai•pan
’. Note the leading spaces in the values
of the second and third fields.
The field separator is represented by the predefined variable
FS
. Shell programmers take note:
awk
does not use
the name IFS
that is used by the
POSIX-compliant shells (such as the Unix Bourne shell, sh
, or Bash).
The value of FS
can be changed in
the awk
program with the assignment
operator, ‘=
’ (see Assignment Expressions). Often, the right time to do this is at the
beginning of execution before any input has been processed, so that the
very first record is read with the proper separator. To do this, use the special BEGIN
pattern (see The BEGIN and END Special Patterns). For example, here we set the value of
FS
to the string ","
:
awk 'BEGIN { FS = "," } ; { print $2 }'
Given the input line:
John Q. Smith, 29 Oak St., Walamazoo, MI 42139
this awk
program extracts and
prints the string ‘•29•Oak•St.
’.
Sometimes the input data contains separator characters that don’t separate fields the way you thought they would. For instance, the person’s name in the example we just used might have a title or suffix attached, such as:
John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
The same program would extract ‘•LXIX
’ instead of ‘•29•Oak•St.
’. If you were expecting the program
to print the address, you would be surprised. The moral is to choose your
data layout and separator characters carefully to prevent such problems.
(If the data is not in a form that is easy to process, perhaps you can
massage it first with a separate awk
program.)
Whitespace Normally Separates Fields
Fields are normally separated by whitespace sequences (spaces,
TABs, and newlines), not by single spaces. Two spaces in a row do not
delimit an empty field. The default value of the field separator
FS
is a string containing a single
space, " "
. If awk
interpreted this value in the usual way,
each space character would separate fields, so two spaces in a row would
make an empty field between them. The reason this does not happen is
that a single space as the value of FS
is a special case—it is taken to specify
the default manner of delimiting fields.
If FS
is any other single
character, such as ","
, then each
occurrence of that character separates two fields. Two consecutive
occurrences delimit an empty field. If the character occurs at the
beginning or the end of the line, that too delimits an empty field. The
space character is the only single character that does not follow these
rules.
Using Regular Expressions to Separate Fields
The previous subsection discussed the use of single characters or
simple strings as the value of FS
.
More generally, the value of FS
may be a string containing any regular
expression. In this case, each match in the record for the regular
expression separates fields. For example, the assignment:
FS = ", \t"
makes every area of an input line that consists of a comma followed by a space and a TAB into a field separator.
For a less trivial example of a regular expression, try using
single spaces to separate fields the way single commas are used.
FS
can be set to "[ ]"
(left bracket, space,
right bracket). This regular expression matches a single space and
nothing else (see Chapter 3).
There is an important difference between the two cases of
‘FS = " "
’ (a single
space) and ‘FS = "[ \t\n]+"
’
(a regular expression matching one or more spaces, TABs, or newlines).
For both values of FS
, fields are
separated by runs (multiple adjacent occurrences)
of spaces, TABs, and/or newlines. However, when the value of FS
is "
"
, awk
first
strips leading and trailing whitespace from the record and then decides
where the fields are. For example, the following pipeline prints ‘b
’:
$ echo ' a b c d ' | awk '{ print $2 }'
b
However, this pipeline prints ‘a
’ (note the extra spaces around each
letter):
$echo ' a b c d ' | awk 'BEGIN { FS = "[ \t\n]+" }
>{ print $2 }'
a
In this case, the first field is null, or empty.
The stripping of leading and trailing whitespace also comes into
play whenever $0
is recomputed. For
instance, study this pipeline:
$ echo ' a b c d' | awk '{ print; $2 = $2; print }'
a b c d
a b c d
The first print
statement
prints the record as it was read, with leading whitespace intact. The
assignment to $2
rebuilds $0
by concatenating $1
through $NF
together, separated by the value of
OFS
(which is a space by default).
Because the leading whitespace was ignored when finding $1
, it is not part of the new $0
. Finally, the last print
statement prints the new $0
.
There is an additional subtlety to be aware of when using regular
expressions for field splitting. It is not well specified in the POSIX standard, or
anywhere else, what ‘^
’ means when
splitting fields. Does the ‘^
’ match
only at the beginning of the entire record? Or is each field separator a
new string? It turns out that different awk
versions answer this question differently, and you should
not rely on any specific behavior in your programs. (d.c.)
As a point of information, BWK awk
allows ‘^
’ to match only at the beginning of the
record. gawk
also works this way. For
example:
$echo 'xxAA xxBxx C' |
>gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
>printf "-->%s<--\n", $i }'
--><-- -->AA<-- -->xxBxx<-- -->C<--
Making Each Character a Separate Field
There are times when you may want to examine each character of a
record separately. This can be done in gawk
by simply assigning the null string
(""
) to FS
. (c.e.) In this case, each individual character in the record becomes
a separate field. For example:
$echo a b | gawk 'BEGIN { FS = "" }
>{
>for (i = 1; i <= NF; i = i + 1)
>print "Field", i, "is", $i
>}'
Field 1 is a Field 2 is Field 3 is b
Traditionally, the behavior of FS
equal to ""
was not defined. In this case, most
versions of Unix awk
simply treat the
entire record as only having one field. (d.c.) In compatibility mode
(see Command-Line Options), if FS
is the null string, then gawk
also behaves this way.
Setting FS from the Command Line
FS
can be set on the command
line. Use the -F
option to do so. For example:
awk -F, 'program
'input-files
sets FS
to the ‘,
’ character. Notice that the option uses an
uppercase ‘F
’ instead of a lowercase
‘f
’. The latter option
(-f
) specifies a file containing an awk
program.
The value used for the argument to -F
is
processed in exactly the same way as assignments to the predefined
variable FS
. Any special characters
in the field separator must be escaped appropriately. For example, to
use a ‘\
’ as the field separator on
the command line, you would have to type:
# same as FS = "\\" awk -F\\\\ '…' files …
Because ‘\
’ is used for quoting
in the shell, awk
sees ‘-F\\
’. Then awk
processes the ‘\\
’ for escape characters (see Escape Sequences), finally yielding a single ‘\
’ to use for the field separator.
As a special case, in compatibility mode (see Command-Line Options), if the argument to -F
is
‘t
’, then FS
is set to the TAB character. If you type
‘-F\t
’ at the shell, without any
quotes, the ‘\
’ gets deleted, so
awk
figures that you really want your
fields to be separated with TABs and not ‘t
’s. Use ‘-v
FS="t"
’ or ‘-F"[t]"
’ on the
command line if you really do want to separate your fields with
‘t
’s. Use ‘-F '\t'
’ when not in compatibility mode to
specify that TABs separate fields.
As an example, let’s use an awk
program file called edu.awk
that
contains the pattern /edu/
and
the action ‘print $1
’:
/edu/ { print $1 }
Let’s also set FS
to be the
‘-
’ character and run the program on
the file mail-list
. The following a
university, and the first three digits of their phone numbers:
$ awk -F- -f edu.awk mail-list
Fabius 555
Samuel 555
Jean
Note the third line of output. The third line in the original file looked like this:
Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R
The ‘-
’ as part of the person’s
name was used as the field separator, instead of the ‘-
’ in the phone number that was originally
intended. This demonstrates why you have to be careful in choosing your
field and record separators.
Perhaps the most common use of a single character as the field
separator occurs when processing the Unix system password file. On many
Unix systems, each user has a separate entry in the system password
file, with one line per user. The information in these lines is
separated by colons. The first field is the user’s login name and the
second is the user’s encrypted or shadow password. (A shadow password is
indicated by the presence of a single ‘x
’ in the second field.) A password file entry
might look like this:
arnold:x:2076:10:Arnold Robbins:/home/arnold:/bin/bash
The following program searches the system password file and prints the entries for users whose full name is not indicated:
awk -F: '$5 == ""' /etc/passwd
Making the Full Line Be a Single Field
Occasionally, it’s useful to treat the whole input line as a
single field. This can be done easily and portably simply by setting
FS
to "\n"
(a newline):[20]
awk -F'\n' 'program
'files …
When you do this, $1
is the
same as $0
.
Field-Splitting Summary
It is important to remember that when you assign a string constant
as the value of FS
, it undergoes
normal awk
string processing. For
example, with Unix awk
and gawk
, the assignment ‘FS = "\.."
’ assigns the character string
".."
to FS
(the backslash is stripped). This creates a regexp meaning “fields are separated by
occurrences of any two characters.” If instead you want fields to be
separated by a literal period followed by any single character, use
‘FS = "\\.."
’.
The following list summarizes how fields are split, based on the value of FS
(‘==
’
means “is equal to”):
FS == " "
Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default.
FS ==
any other single character
Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty fields, as do leading and trailing occurrences. The character can even be a regexp metacharacter; it does not need to be escaped.
FS ==
regexp
Fields are separated by occurrences of characters that match
regexp
. Leading and trailing matches ofregexp
delimit empty fields.FS == ""
Each individual character in the record becomes a separate field. (This is a common extension; it is not specified by the POSIX standard.)
Reading Fixed-Width Data
This section discusses an advanced feature of gawk
. If you are a novice awk
user, you might want to skip it on the first
reading.
gawk
provides a facility for
dealing with fixed-width fields with no distinctive field separator. For
example, data of this nature arises in the input for old Fortran programs
where numbers are run together, or in the output of programs that did not
anticipate the use of their output as input for other programs.
An example of the latter is a table where all the columns are lined
up by the use of a variable number of spaces and empty fields
are just spaces. Clearly, awk
’s normal field splitting based on FS
does not work well in this case. Although a
portable awk
program can use a series
of substr()
calls on $0
(see String-Manipulation Functions),
this is awkward and inefficient for a large number of fields.
The splitting of an input record into fixed-width fields is
specified by assigning a string containing space-separated numbers to the
built-in variable FIELDWIDTHS
. Each number specifies the width of the field,
including columns between fields. If you want to
ignore the columns between fields, you can specify the width as a separate
field that is subsequently ignored.
It is a fatal error to supply a field width that is not a positive number.
The following data is the output of the Unix w
utility. It is useful to illustrate the use of
FIELDWIDTHS
:
10:06pm up 21 days, 14:04, 23 users User tty login idle JCPU PCPU what hzuo ttyV0 8:58pm 9 5 vi p24.tex hzang ttyV3 6:37pm 50 -csh eklye ttyV5 9:53pm 7 1 em thes.tex dportein ttyV6 8:17pm 1:47 -csh gierd ttyD3 10:00pm 1 elm dave ttyD4 9:47pm 4 4 w brent ttyp0 26Jun91 4:46 26:46 4:41 bash dave ttyq4 26Jun9115days 46 46 wnewmail
The following program takes this input, converts the idle time to number of seconds, and prints out the first two fields and the calculated idle time:
BEGIN { FIELDWIDTHS = "9 6 10 6 7 7 35" } NR > 2 { idle = $4 sub(/^ +/, "", idle) # strip leading spaces if (idle == "") idle = 0 if (idle ~ /:/) { split(idle, t, ":") idle = t[1] * 60 + t[2] } if (idle ~ /days/) idle *= 24 * 60 * 60 print $1, $2, idle }
Note
The preceding program uses a number of awk
features that haven’t been introduced
yet.
Running the program on the data produces the following results:
hzuo ttyV0 0 hzang ttyV3 50 eklye ttyV5 0 dportein ttyV6 107 gierd ttyD3 1 dave ttyD4 0 brent ttyp0 286 dave ttyq4 1296000
Another (possibly more practical) example of fixed-width input data
is the input from a deck of balloting cards. In some parts of the United
States, voters mark their choices by punching holes in computer cards.
These cards are then processed to count the votes for any particular
candidate or on any particular issue. Because a voter may choose not to
vote on some issue, any column on the card may be empty. An awk
program for processing such data could use
the FIELDWIDTHS
feature to simplify
reading the data. (Of course, getting gawk
to run on a system with card readers is
another story!)
Assigning a value to FS
causes
gawk
to use FS
for field splitting again. Use ‘FS = FS
’ to make this happen, without having to
know the current value of FS
. In order
to tell which kind of field splitting is in effect, use PROCINFO["FS"]
(see Built-in Variables That Convey Information). The value is "FS"
if regular field splitting is being used, or is "FIELDWIDTHS"
if fixed-width field splitting is
being used:
if (PROCINFO["FS"] == "FS")regular field splitting
… else if (PROCINFO["FS"] == "FIELDWIDTHS")fixed-width field splitting
… elsecontent-based field splitting
… (see next section)
This information is useful when writing a function that needs to
temporarily change FS
or FIELDWIDTHS
, read some records, and then restore
the original settings (see Reading the User Database for an
example of such a function).
Defining Fields by Content
This section discusses an advanced feature of gawk
. If you are a novice awk
user, you might want to skip it on the first
reading.
Normally, when using FS
, gawk
defines the fields as the parts of the
record that occur in between each field separator. In other words,
FS
defines what a field is
not, instead of what a field is. However,
there are times when you really want to define the fields by what they
are, and not by what they are not.
The most notorious such case is so-called comma-separated values (CSV) data. Many spreadsheet programs, for example, can export their data into text files, where each record is terminated with a newline, and fields are separated by commas. If commas only separated the data, there wouldn’t be an issue. The problem comes when one of the fields contains an embedded comma. In such cases, most programs embed the field in double quotes.[22] So, we might have data like this:
Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
The FPAT
variable offers a
solution for cases like this. The value of FPAT
should
be a string that provides a regular expression. This regular expression describes the contents of each
field.
In the case of CSV data as presented here, each field is either
“anything that is not a comma,” or “a double quote, anything that is not a
double quote, and a closing double quote.” If written as a regular
expression constant (see Chapter 3), we would have
/([^,]+)|("[^"]+")/
. Writing this as a
string requires us to escape the double quotes, leading to:
FPAT = "([^,]+)|(\"[^\"]+\")"
Putting this to use, here is a simple program to parse the data:
BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")" } { print "NF = ", NF for (i = 1; i <= NF; i++) { printf("$%d = <%s>\n", i, $i) } }
When run, we get the following:
$ gawk -f simple-csv.awk addresses.csv
NF = 7
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 A Pretty Street, NE">
$4 = <MyTown>
$5 = <MyState>
$6 = <12345-6789>
$7 = <USA>
Note the embedded comma in the value of $3
.
A straightforward improvement when processing CSV data of this sort would be to remove the quotes when they occur, with something like this:
if (substr($i, 1, 1) == "\"") { len = length($i) $i = substr($i, 2, len - 2) # Get text within the two quotes }
As with FS
, the IGNORECASE
variable (see Built-in Variables That Control awk) affects field splitting with FPAT
.
Assigning a value to FPAT
overrides field splitting with FS
and
with FIELDWIDTHS
. Similar to FIELDWIDTHS
, the value of PROCINFO["FS"]
will be "FPAT"
if content-based field splitting is being
used.
Note
Some programs export CSV data that contains embedded newlines
between the double quotes. gawk
provides no way to deal with this. Even though a formal specification
for CSV data exists, there isn’t much more to be done; the FPAT
mechanism provides an elegant solution
for the majority of cases, and the gawk
developers are satisfied with
that.
As written, the regexp used for FPAT
requires that each field contain at least
one character. A straightforward modification (changing the first
‘+
’ to ‘*
’) allows fields to be empty:
FPAT = "([^,]*)|(\"[^\"]+\")"
Finally, the patsplit()
function
makes the same functionality available for splitting regular strings (see
String-Manipulation Functions).
To recap, gawk
provides three
independent methods to split input records into fields. The mechanism used
is based on which of the three variables—FS
, FIELDWIDTHS
, or FPAT
—was last assigned to.
Multiple-Line Records
In some databases, a single line cannot conveniently hold all the information in one entry. In such cases, you can use multiline records. The first step in doing this is to choose your data format.
One technique is to use an unusual character or string to separate
records. For example, you could use the formfeed character (written
‘\f
’ in awk
, as in C) to separate them, making each
record a page of the file. To do this, just set the variable RS
to "\f"
(a
string containing the formfeed character). Any other character could
equally well be used, as long as it won’t be part of the data in a
record.
Another technique is to have blank lines separate records. By a
special dispensation, an empty string as the value of RS
indicates that records are separated by one
or more blank lines. When RS
is set to the
empty string, each record always ends at the first blank line encountered.
The next record doesn’t start until the first nonblank line that follows.
No matter how many blank lines appear in a row, they all act as one record
separator. (Blank lines must be completely empty; lines that contain only
whitespace do not count.)
You can achieve the same effect as ‘RS =
""
’ by assigning the string "\n\n+"
to RS
. This regexp matches the newline at the end
of the record and one or more blank lines after the record. In addition, a
regular expression always matches the longest possible sequence when there
is a choice (see How Much Text Matches?). So, the next record
doesn’t start until the first nonblank line that follows—no matter how
many blank lines appear in a row, they are considered one record
separator.
However, there is an important difference between ‘RS = ""
’ and ‘RS =
"\n\n+"
’. In the first case, leading newlines in the input
datafile are ignored, and if a file ends without extra blank lines after
the last record, the final newline is removed from the record. In the
second case, this special processing is not done. (d.c.)
Now that the input is separated into records, the second step is to
separate the fields in the records. One way to do this is to divide each
of the lines into fields in the normal manner. This happens by default as
the result of a special feature. When RS
is set to the empty string
and FS
is set to a
single character, the newline character always acts
as a field separator. This is in addition to whatever field separations
result from FS
.[23]
The original motivation for this special exception was probably to
provide useful behavior in the default case (i.e., FS
is equal to "
"
). This feature can be a problem if you really
don’t want the newline character to separate fields, because there is no
way to prevent it. However, you can work around this by using the split()
function to break up the record manually
(see String-Manipulation Functions). If you have a single-character
field separator, you can work around the special feature in a different
way, by making FS
into a regexp for
that single character. For example, if the field separator is a percent
character, instead of ‘FS = "%"
’, use
‘FS = "[%]"
’.
Another way to separate fields is to put each field on a separate
line: to do this, just set the variable FS
to the string "\n"
. (This single-character separator matches a single newline.)
A practical example of a datafile organized this way might be a mailing
list, where blank lines separate the entries. Consider a mailing list in a
file named addresses
, which
looks like this:
Jane Doe 123 Main Street Anywhere, SE 12345-6789 John Smith 456 Tree-lined Avenue Smallville, MW 98765-4321 …
A simple program to process this file is as follows:
# addrs.awk --- simple mailing list program # Records are separated by blank lines. # Each line is one field. BEGIN { RS = "" ; FS = "\n" } { print "Name is:", $1 print "Address is:", $2 print "City and State are:", $3 print "" }
Running the program produces the following output:
$ awk -f addrs.awk addresses
Name is: Jane Doe
Address is: 123 Main Street
City and State are: Anywhere, SE 12345-6789
Name is: John Smith
Address is: 456 Tree-lined Avenue
City and State are: Smallville, MW 98765-4321
…
See Printing Mailing Labels for a more realistic program
dealing with address lists. The following list summarizes how records are split, based
on the value of RS
:
RS == "\n"
Records are separated by the newline character (‘
\n
’). In effect, every line in the datafile is a separate record, including blank lines. This is the default.RS ==
any single character
Records are separated by each occurrence of the character. Multiple successive occurrences delimit empty records.
RS == ""
Records are separated by runs of blank lines. When
FS
is a single character, then the newline character always serves as a field separator, in addition to whatever valueFS
may have. Leading and trailing newlines in a file are ignored.RS ==
regexp
Records are separated by occurrences of characters that match
regexp
. Leading and trailing matches ofregexp
delimit empty records. (This is agawk
extension; it is not specified by the POSIX standard.)
If not in compatibility mode (see Command-Line Options),
gawk
sets RT
to the input text that matched the value specified by RS
. But if the input file ended without any text
that matches RS
, then gawk
sets RT
to the null string.
Explicit Input with getline
So far we have been getting our input data from awk
’s main input stream—either the standard
input (usually your keyboard, sometimes the output from another program)
or the files specified on the command line. The awk
language has a special built-in command
called getline
that can be used to read
input under your explicit control.
The getline
command is used in
several different ways and should not be used by
beginners. The examples that follow the explanation of the getline
command include material that has not
been covered yet. Therefore, come back and study the getline
command after you
have reviewed the rest of Parts I and II and have a good knowledge of how
awk
works.
The getline
command returns 1 if
it finds a record and 0 if it encounters the end of the file. If there is
some error in getting a record, such as a file that cannot be opened, then
getline
returns −1. In this case,
gawk
sets the variable ERRNO
to a string describing the error that
occurred.
In the following examples, command
stands
for a string value that represents a shell
command.
Note
When --sandbox
is specified (see Command-Line Options), reading lines from files, pipes, and coprocesses
is disabled.
Using getline with No Arguments
The getline
command can be used
without arguments to read input from the current input file. All it does in this case is read the next input record
and split it up into fields. This is useful if you’ve finished
processing the current record, but want to do some special processing on
the next record right now. For example:
# Remove text between /* and */, inclusive { if ((i = index($0, "/*")) != 0) { out = substr($0, 1, i - 1) # leading part of the string rest = substr($0, i + 2) # ... */ ... j = index(rest, "*/") # is */ in trailing part? if (j > 0) { rest = substr(rest, j + 2) # remove comment } else { while (j == 0) { # get more text if (getline <= 0) { print("unexpected EOF or error:", ERRNO) > "/dev/stderr" exit } # build up the line using string concatenation rest = rest $0 j = index(rest, "*/") # is */ in trailing part? if (j != 0) { rest = substr(rest, j + 2) break } } } # build up the output line using string concatenation $0 = out rest } print $0 }
This awk
program deletes
C-style comments (‘/* … */
’) from the
input. It uses a number of features we haven’t covered yet,
including string concatenation (see String Concatenation)
and the index()
and substr()
built-in functions (see String-Manipulation Functions). By replacing the ‘print
$0
’ with other statements,
you could perform more complicated processing on the
decommented input, such as searching for matches of a regular
expression. (This program has a subtle problem—it does not work if one
comment ends and another begins on the same line.)
This form of the getline
command sets NF
, NR
, FNR
,
RT
, and the value of $0
.
Note
The new value of $0
is used
to test the patterns of any subsequent rules. The original value of
$0
that triggered the rule that
executed getline
is lost. By
contrast, the next
statement reads
a new record but immediately begins processing it normally, starting
with the first rule in the program. See The next Statement.
Using getline into a Variable
You can use ‘getline
’ to read the next record from
var
awk
’s input into the variable
var
. No other processing is done. For example, suppose the next
line is a comment or a special string, and you want to read it without
triggering any rules. This form of getline
allows you to read that line and store
it in a variable so that the main read-a-line-and-check-each-rule loop
of awk
never sees it. The following
example swaps every two lines of input:
{ if ((getline tmp) > 0) { print tmp print $0 } else print $0 }
It takes the following list:
wan tew free phore
and produces these results:
tew wan phore free
The getline
command used in
this way sets only the variables NR
,
FNR
, and RT
(and, of course,
var
). The record is not split into fields, so
the values of the fields (including $0
) and the value of NF
do not change.
Using getline from a File
Use ‘getline <
’ to read the next record from
file
file
. Here, file
is a string-valued expression that specifies the filename. ‘<
’ is called a
redirection because it directs input to come from
a different place. For example, the following program reads its input
record from the file file
secondary.input
when it encounters a first
field with a value equal to 10 in the current input file:
{ if ($1 == 10) { getline < "secondary.input" print } else print }
Because the main input stream is not used, the values of NR
and FNR
are not changed. However, the record it reads is split into fields in
the normal manner, so the values of $0
and the other fields are changed, resulting
in a new value of NF
. RT
is also set.
According to POSIX, ‘getline <
’ is ambiguous if
expression
expression
contains unparenthesized operators
other than ‘$
’; for example,
‘getline < dir "/" file
’ is
ambiguous because the concatenation operator (not discussed yet; see
String Concatenation) is not parenthesized. You should write
it as ‘getline < (dir "/" file)
’
if you want your program to be portable to all awk
implementations.
Using getline into a Variable from a File
Use ‘getline
’ to read input from the file
var
<
file
file
, and put it in the variable
var
. As earlier, file
is a
string-valued expression that specifies the file from which to read.
In this version of getline
,
none of the predefined variables are changed and the record is not split
into fields. The only variable changed is
var
.[24] For example, the following program copies all the input
files to the output, except for records that say ‘@include
’. Such a
record is replaced by the contents of the file
filename
filename
:
{ if (NF == 2 && $1 == "@include") { while ((getline line < $2) > 0) print line close($2) } else print }
Note here how the name of the extra input file is not built into
the program; it is taken directly from the data, specifically from the
second field on the @include
line.
The close()
function is
called to ensure that if two identical @include
lines appear in the input, the entire
specified file is included twice. See Closing Input and Output Redirections.
One deficiency of this program is that it does not process nested
@include
statements (i.e., @include
statements in included files) the way
a true macro preprocessor would. See An Easy Way to Use Library Functions
for a program that does handle nested @include
statements.
Using getline from a Pipe
Omniscience has much to recommend it. Failing that, attention to details would be useful.
—Brian Kernighan
The output of a command can also be piped into getline
, using ‘
’.
In this case, the string command
| getlinecommand
is run as a
shell command and its output is piped into awk
to be used as input. This form of getline
reads one record at a time from the
pipe. For example, the following program copies its input to its output,
except for lines that begin with ‘@execute
’, which are replaced by the output
produced by running the rest of the line as a shell command:
{ if ($1 == "@execute") { tmp = substr($0, 10) # Remove "@execute" while ((tmp | getline) > 0) print close(tmp) } else print }
The close()
function is called
to ensure that if two identical ‘@execute
’ lines appear in the input, the command is run for each one.
Given the input:
foo bar baz @execute who bletch
the program might produce:
foo bar baz arnold ttyv0 Jul 13 14:22 miriam ttyp0 Jul 13 14:23 (murphy:0) bill ttyp1 Jul 13 14:23 (murphy:0) bletch
Notice that this program ran the command who
and printed the result. (If you try this
program yourself, you will of course get different results, depending
upon who is logged in on your system.)
This variation of getline
splits the record into fields, sets the value of NF
, and recomputes the value of $0
. The values of NR
and FNR
are not changed. RT
is set.
According to POSIX, ‘
’ is ambiguous if expression
|
getlineexpression
contains unparenthesized operators other than ‘$
’—for example, ‘"echo " "date" | getline
’ is
ambiguous because the concatenation operator is not parenthesized. You
should write it as ‘("echo "
"date") | getline
’ if you want your program to be portable to
all awk
implementations.
Note
Unfortunately, gawk
has not
been consistent in its treatment of a construct like ‘"echo " "date" | getline
’. Most
versions, including the current version, treat it at as ‘("echo " "date") | getline
’.
(This is also how BWK awk
behaves.)
Some versions instead treat it as ‘"echo
" ("date" | getline)
’. (This is how mawk
behaves.) In short,
always use explicit parentheses, and then you
won’t have to worry.
Using getline into a Variable from a Pipe
When you use ‘
’, the output of
command
| getline
var
command
is sent through a pipe to getline
and
into the variable var
. For example, the
following program reads the current date and time into the variable
current_time
, using the date
utility, and then prints it:
BEGIN { "date" | getline current_time close("date") print "Report printed on " current_time }
In this version of getline
,
none of the predefined variables are changed and the record is not split
into fields. However, RT
is
set.
Using getline from a Coprocess
Reading input into getline
from
a pipe is a one-way operation. The command that is started with ‘
’
only sends data to your command
| getlineawk
program.
On occasion, you might want to send data to another program for
processing and then read the results back. gawk
allows you to start a
coprocess, with which two-way communications are
possible. This is done with the ‘|&
’ operator. Typically, you write data to
the coprocess first and then read the results back, as shown in the
following:
print "some query
" |& "db_server"
"db_server" |& getline
which sends a query to db_server
and then reads the results.
The values of NR
and FNR
are not changed, because the main input
stream is not used. However, the record is split into fields in the
normal manner, thus changing the values of $0
, of the other fields, and of NF
and RT
.
Coprocesses are an advanced feature. They are discussed here only
because this is the section on getline
. See Two-Way Communications with Another Process, where coprocesses are discussed in
more detail.
Using getline into a Variable from a Coprocess
When you use ‘
’, the output from the coprocess
command
|& getline
var
command
is sent through a two-way pipe to
getline
and into the variable
var
.
In this version of getline
,
none of the predefined variables are changed and the record is not split
into fields. The only variable changed is
var
. However, RT
is set.
Points to Remember About getline
Here are some miscellaneous points about getline
that you should bear in mind:
When
getline
changes the value of$0
andNF
,awk
does not automatically jump to the start of the program and start testing the new record against every pattern. However, the new record is tested against any subsequent rules.Some very old
awk
implementations limit the number of pipelines that anawk
program may have open to just one. Ingawk
, there is no such limit. You can open as many pipelines (and coprocesses) as the underlying operating system permits.An interesting side effect occurs if you use
getline
without a redirection inside aBEGIN
rule. Because an unredirectedgetline
reads from the command-line datafiles, the firstgetline
command causesawk
to set the value ofFILENAME
. Normally,FILENAME
does not have a value insideBEGIN
rules, because you have not yet started to process the command-line datafiles. (d.c.) (See The BEGIN and END Special Patterns; also see Built-in Variables That Convey Information.)Using
FILENAME
withgetline
(‘getline < FILENAME
’) is likely to be a source of confusion.awk
opens a separate input stream from the current input file. However, by not using a variable,$0
andNF
are still updated. If you’re doing this, it’s probably by accident, and you should reconsider what it is you’re trying to accomplish.The next section presents a table summarizing the
getline
variants and which variables they can affect. It is worth noting that those variants that do not use redirection can causeFILENAME
to be updated if they causeawk
to start reading a new input file.If the variable being assigned is an expression with side effects, different versions of
awk
behave differently upon encountering end-of-file. Some versions don’t evaluate the expression; many versions (includinggawk
) do. Here is an example, courtesy of Duncan Moore:BEGIN { system("echo 1 > f") while ((getline a[++c] < "f") > 0) { } print c }
Here, the side effect is the ‘
++c
’. Isc
incremented if end-of-file is encountered before the element ina
is assigned?gawk
treatsgetline
like a function call, and evaluates the expression ‘a[++c]
’ before attempting to read fromf
. However, some versions ofawk
only evaluate the expression once they know that there is a string value to be assigned.
Summary of getline Variants
Table 4-1 summarizes the
eight variants of getline
, listing
which predefined variables are set by each one, and
whether the variant is standard or a gawk
extension. Note: for each variant,
gawk
sets the RT
predefined variable.
Variant | Effect | awk / gawk |
| Sets |
|
| Sets |
|
| Sets |
|
| Sets |
|
| Sets |
|
| Sets |
|
| Sets |
|
| Sets |
|
Reading Input with a Timeout
This section describes a feature that is specific to gawk
.
You may specify a timeout in milliseconds for reading input from the
keyboard, a pipe, or two-way communication, including TCP/IP sockets. This
can be done on a per-input, per-command, or per-connection basis, by
setting a special element in the PROCINFO
array
(see Built-in Variables That Convey Information):
PROCINFO["input_name", "READ_TIMEOUT"] = timeout in milliseconds
When set, this causes gawk
to
time out and return failure if no data is available to read within the
specified timeout period. For example, a TCP client can decide to give up
on receiving any response from the server after a certain amount of
time:
Service = "/inet/tcp/0/localhost/daytime" PROCINFO[Service, "READ_TIMEOUT"] = 100 if ((Service |& getline) > 0) print $0 else if (ERRNO != "") print ERRNO
Here is how to read interactively from the user[25] without waiting for more than five seconds:
PROCINFO["/dev/stdin", "READ_TIMEOUT"] = 5000 while ((getline < "/dev/stdin") > 0) print $0
gawk
terminates the read
operation if input does not arrive after waiting for the timeout period,
returns failure, and sets ERRNO
to an
appropriate string value. A negative or zero value for the timeout is the
same as specifying no timeout at all.
A timeout can also be set for reading from the keyboard in the implicit loop that reads input records and matches them against patterns, like so:
$gawk 'BEGIN { PROCINFO["-", "READ_TIMEOUT"] = 5000 }
>{ print "You entered: " $0 }'
gawk
You entered: gawk
In this case, failure to respond within five seconds results in the following error message:
error→ gawk: cmd. line:2: (FILENAME=- FNR=1) fatal: error reading input file `-': Connection timed out
The timeout can be set or changed at any time, and will take effect on the next attempt to read from the input device. In the following example, we start with a timeout value of one second, and progressively reduce it by one-tenth of a second until we wait indefinitely for the input to arrive:
PROCINFO[Service, "READ_TIMEOUT"] = 1000 while ((Service |& getline) > 0) { print $0 PROCINFO[Service, "READ_TIMEOUT"] -= 100 }
Note
You should not assume that the read operation will block exactly
after the tenth record has been printed. It is possible that gawk
will read and buffer more than one
record’s worth of data the first time. Because of this, changing the
value of timeout like the preceding example is not very useful.
If the PROCINFO
element is not
present and the GAWK_READ_TIMEOUT
environment variable
exists, gawk
uses its value to
initialize the timeout value.
The exclusive use of the environment variable to specify
timeout has the disadvantage of not being able to control it on a
per-command or per-connection basis.
gawk
considers a timeout event to
be an error even though the attempt to read from the underlying device may
succeed in a later attempt. This is a limitation, and it also means that
you cannot use this to multiplex input from two or more sources.
Assigning a timeout value prevents read operations from blocking
indefinitely. But bear in mind that there are other ways gawk
can stall waiting for an input device to be
ready. A network client can sometimes take a long time to establish a
connection before it can start reading any data, or the attempt to open a
FIFO special file for reading can block indefinitely until some other
process opens it for writing.
Directories on the Command Line
According to the POSIX standard, files named on the awk
command line must be text files; it is a
fatal error if they are not. Most versions of awk
treat a directory on the command line as a fatal error.
By default, gawk
produces a
warning for a directory on the command line, but otherwise ignores it.
This makes it easier to use shell wildcards with your awk
program:
$ gawk -f whizprog.awk *
Directories could kill this program
If either of the --posix
or
--traditional
options is given, then gawk
reverts to treating a directory on the
command line as a fatal error.
See Reading Directories for a way to treat
directories as usable data from an awk
program.
Summary
Input is split into records based on the value of
RS
. The possibilities are as follows:Value of RS
Records are split on…
awk / gawk
Any single character
That character
awk
The empty string (
""
)Runs of two or more newlines
awk
A regexp
Text that matches the regexp
gawk
FNR
indicates how many records have been read from the current input file;NR
indicates how many records have been read in total.gawk
setsRT
to the text matched byRS
.After splitting the input into records,
awk
further splits the records into individual fields, named$1
,$2
, and so on.$0
is the whole record, andNF
indicates how many fields there are. The default way to split fields is between whitespace characters.Fields may be referenced using a variable, as in
$NF
. Fields may also be assigned values, which causes the value of$0
to be recomputed when it is later referenced. Assigning to a field with a number greater thanNF
creates the field and rebuilds the record, usingOFS
to separate the fields. IncrementingNF
does the same thing. DecrementingNF
throws away fields and rebuilds the record.Field splitting is more complicated than record splitting:
Field separator value
Fields are split …
awk / gawk
FS == " "
On runs of whitespace
awk
FS ==
any single character
On that character
awk
FS ==
regexp
On text matching the regexp
awk
FS == ""
Such that each individual character is a separate field
gawk
FIELDWIDTHS ==
list of columns
Based on character position
gawk
FPAT ==
regexp
On the text surrounding text matching the regexp
gawk
Using ‘
FS = "\n"
’ causes the entire record to be a single field (assuming that newlines separate records).FS
may be set from the command line using the-F
option. This can also be done using command-line variable assignment.Use
PROCINFO["FS"]
to see how fields are being split.Use
getline
in its various forms to read additional records, from the default input stream, from a file, or from a pipe or coprocess.Use
PROCINFO[
to cause reads to time out forfile
, "READ_TIMEOUT"]file
.Directories on the command line are fatal for standard
awk
;gawk
ignores them if not in POSIX mode.
[17] At least that we know about.
[18] In POSIX awk
, newlines are
not considered whitespace for separating fields.
[19] A binary operator, such as ‘*
’ for multiplication, is one that takes two
operands. The distinction is required because awk
also has unary (one-operand) and ternary
(three-operand) operators.
[20] Thanks to Andrew Schorr for this tip.
[21] The sed
utility is a
“stream editor.” Its behavior is also defined by the POSIX
standard.
[22] The CSV format lacked a formal standard definition for many years. RFC 4180 standardizes the most common practices.
[23] When FS
is the null string
(""
) or a regexp, this special
feature of RS
does not apply. It
does apply to the default field separator of a single space: ‘FS = " "
’.
[24] This is not quite true. RT
could be changed if RS
is a
regular expression.
[25] This assumes that standard input is the keyboard.
Get Effective awk Programming, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.