Effective awk Programming, 4th Edition

Chapter 4. Reading Input Files

In the typical awk program, awk reads all input either from the standard input (by default, this is the keyboard, but often it is a pipe from another command) or from files whose names you specify on the awk command line. If you specify input files, awk reads them in order, processing all the data from one before going on to the next. The name of the current input file can be found in the predefined variable FILENAME (see Predefined Variables).

The input is read in units called records, and is processed by the rules of your program one record at a time. By default, each record is one line. Each record is automatically split into chunks called fields. This makes it more convenient for programs to work on the parts of a record.

On rare occasions, you may need to use the getline command. The getline command is valuable both because it can do explicit input from any number of files, and because the files used with it do not have to be named on the awk command line (see Explicit Input with getline).

How Input Is Split into Records

awk divides the input for your program into records and fields. It keeps track of the number of records that have been read so far from the current input file. This value is stored in a predefined variable called FNR, which is reset to zero every time a new file is started. Another predefined variable, NR, records the total number of input records read so far from all datafiles. It starts at zero, but is never automatically reset to zero.

Record Splitting with Standard awk

Records are separated by a character called the record separator. By default, the record separator is the newline character. This is why records are, by default, single lines. To use a different character for the record separator, simply assign that character to the predefined variable RS.

Like any other variable, the value of RS can be changed in the awk program with the assignment operator, ‘=’ (see Assignment Expressions). The new record-separator character should be enclosed in quotation marks, which indicate a string constant. Often, the right time to do this is at the beginning of execution, before any input is processed, so that the very first record is read with the proper separator. To do this, use the special BEGIN pattern (see The BEGIN and END Special Patterns). For example:

awk 'BEGIN { RS = "u" }
     { print $0 }' mail-list

changes the value of RS to ‘u’, before reading any input. The new value is a string whose first character is the letter “u”; as a result, records are separated by the letter “u”. Then the input file is read, and the second rule in the awk program (the action with no pattern) prints each record. Because each print statement adds a newline at the end of its output, this awk program copies the input with each ‘u’ changed to a newline. Here are the results of running the program on mail-list:

$ awk 'BEGIN { RS = "u" }
>      { print $0 }' mail-list
Amelia       555-5553     amelia.zodiac
sq
e@gmail.com    F
Anthony      555-3412     anthony.assert
ro@hotmail.com   A
Becky        555-7685     becky.algebrar
m@gmail.com      A
Bill         555-1675     bill.drowning@hotmail.com       A
Broderick    555-0542     broderick.aliq
otiens@yahoo.com R
Camilla      555-2912     camilla.inf
sar
m@skynet.be     R
Fabi
s       555-1234     fabi
s.
ndevicesim
s@
cb.ed
    F
J
lie        555-6699     j
lie.perscr
tabor@skeeve.com   F
Martin       555-6480     martin.codicib
s@hotmail.com    A
Sam
el       555-3430     sam
el.lanceolis@sh
.ed
        A
Jean-Pa
l    555-2127     jeanpa
l.campanor
m@ny
.ed
     R

Note that the entry for the name ‘Bill’ is not split. In the original datafile (see Datafiles for the Examples), the line looks like this:

Bill         555-1675     bill.drowning@hotmail.com       A

It contains no ‘u’, so there is no reason to split the record, unlike the others, which each have one or more occurrences of the ‘u’. In fact, this record is treated as part of the previous record; the newline separating them in the output is the original newline in the datafile, not the one added by awk when it printed the record!

Another way to change the record separator is on the command line, using the variable-assignment feature (see Other Command-Line Arguments):

awk '{ print $0 }' RS="u" mail-list

This sets RS to ‘u’ before processing mail-list.

Using an alphabetic character such as ‘u’ for the record separator is highly likely to produce strange results. Using an unusual character such as ‘/’ is more likely to produce correct behavior in the majority of cases, but there are no guarantees. The moral is: Know Your Data.

When using regular characters as the record separator, there is one unusual case that occurs when gawk is being fully POSIX-compliant (see Command-Line Options). Then, the following (extreme) pipeline prints a surprising ‘1’:

$ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
1

There is one field, consisting of a newline. The value of the built-in variable NF is the number of fields in the current record. (In the normal case, gawk treats the newline as whitespace, printing ‘0’ as the result. Most other versions of awk also act this way.)

Reaching the end of an input file terminates the current input record, even if the last character in the file is not the character in RS. (d.c.)

The empty string "" (a string without any characters) has a special meaning as the value of RS. It means that records are separated by one or more blank lines and nothing else. See Multiple-Line Records for more details.

If you change the value of RS in the middle of an awk run, the new value is used to delimit subsequent records, but the record currently being processed, as well as records already processed, are not affected.

After the end of the record has been determined, gawk sets the variable RT to the text in the input that matched RS.

Record Splitting with gawk

When using gawk, the value of RS is not limited to a one-character string. It can be any regular expression (see Chapter 3). (c.e.) In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string. This general rule is actually at work in the usual case, where RS contains just a newline: a record ends at the beginning of the next matching string (the next newline in the input), and the following record starts just after the end of this string (at the first character of the following line). The newline, because it matches RS, is not part of either record.

When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression.

If the input file ends without any text matching RS, gawk sets RT to the null string.

The following example illustrates both of these features. It sets RS equal to a regular expression that matches either a newline or a series of one or more uppercase letters with optional leading and/or trailing whitespace:

$ echo record 1 AAAA record 2 BBBB record 3 |
> gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
>             { print "Record =", $0,"and RT = [" RT "]" }'
Record = record 1 and RT = [ AAAA ]
Record = record 2 and RT = [ BBBB ]
Record = record 3 and RT = [
]

The square brackets delineate the contents of RT, letting you see the leading and trailing whitespace. The final value of RT is a newline. See A Simple Stream Editor for a more useful example of RS as a regexp and RT.

If you set RS to a regular expression that allows optional trailing text, such as ‘RS = "abc(XYZ)?"’, it is possible, due to implementation constraints, that gawk may match the leading part of the regular expression, but not the trailing part, particularly if the input text that could match the trailing part is fairly long. gawk attempts to avoid this problem, but currently, there’s no guarantee that this will never happen.

Note

Remember that in awk, the ‘^’ and ‘$’ anchor metacharacters match the beginning and end of a string, and not the beginning and end of a line. As a result, something like ‘RS = "^[[:upper:]]"’ can only match at the beginning of a file. This is because gawk views the input file as one long string that happens to contain newline characters. It is thus best to avoid anchor metacharacters in the value of RS.

The use of RS as a regular expression and the RT variable are gawk extensions; they are not available in compatibility mode (see Command-Line Options). In compatibility mode, only the first character of the value of RS determines the end of the record.

There are times when you might want to treat an entire datafile as a single record. The only way to make this happen is to give RS a value that you know doesn’t occur in the input file. This is hard to do in a general way, such that a program always works for arbitrary input files.

You might think that for text files, the NUL character, which consists of a character with all bits equal to zero, is a good value to use for RS in this case:

BEGIN { RS = "\0" }  # whole file becomes one record?

gawk in fact accepts this, and uses the NUL character for the record separator. This works for certain special files, such as /proc/environ on GNU/Linux systems, where the NUL character is in fact the record separator. However, this usage is not portable to most other awk implementations.

Almost all other awk implementations^[17] store strings internally as C-style strings. C strings use the NUL character as the string terminator. In effect, this means that ‘RS = "\0"’ is the same as ‘RS = ""’. (d.c.)

It happens that recent versions of mawk can use the NUL character as a record separator. However, this is a special case: mawk does not allow embedded NUL characters in strings. (This may change in a future version of mawk.)

See Reading a Whole File at Once for an interesting way to read whole files. If you are using gawk, see Reading an Entire File for another option.

Examining Fields

When awk reads an input record, the record is automatically parsed or separated by the awk utility into chunks called fields. By default, fields are separated by whitespace, like words in a line. Whitespace in awk means any string of one or more spaces, TABs, or newlines;^[18] other characters that are considered whitespace by other languages (such as formfeed, vertical tab, etc.) are not considered whitespace by awk.

The purpose of fields is to make it more convenient for you to refer to these pieces of the record. You don’t have to use them—you can operate on the whole record if you want—but fields are what make simple awk programs so powerful.

You use a dollar sign (‘$’) to refer to a field in an awk program, followed by the number of the field you want. Thus, $1 refers to the first field, $2 to the second, and so on. (Unlike in the Unix shells, the field numbers are not limited to single digits. $127 is the 127th field in the record.) For example, suppose the following is a line of input:

This seems like a pretty nice example.

Here the first field, or $1, is ‘This’, the second field, or $2, is ‘seems’, and so on. Note that the last field, $7, is ‘example.’. Because there is no space between the ‘e’ and the ‘.’, the period is considered part of the seventh field.

NF is a predefined variable whose value is the number of fields in the current record. awk automatically updates the value of NF each time it reads a record. No matter how many fields there are, the last field in a record can be represented by $NF. So, $NF is the same as $7, which is ‘example.’. If you try to reference a field beyond the last one (such as $8 when the record has only seven fields), you get the empty string. (If used in a numeric operation, you get zero.)

The use of $0, which looks like a reference to the “zeroth” field, is a special case: it represents the whole input record. Use it when you are not interested in specific fields. Here are some more examples:

$ awk '$1 ~ /li/ { print $0 }' mail-list
Amelia       555-5553     amelia.zodiacusque@gmail.com    F
Julie        555-6699     julie.perscrutabor@skeeve.com   F

This example prints each record in the file mail-list whose first field contains the string ‘li’.

By contrast, the following example looks for ‘li’ in the entire record and prints the first and last fields for each matching input record:

$ awk '/li/ { print $1, $NF }' mail-list
Amelia F
Broderick R
Julie F
Samuel A

Nonconstant Field Numbers

A field number need not be a constant. Any expression in the awk language can be used after a ‘$’ to refer to a field. The value of the expression specifies the field number. If the value is a string, rather than a number, it is converted to a number. Consider this example:

awk '{ print $NR }'

Recall that NR is the number of records read so far: one in the first record, two in the second, and so on. So this example prints the first field of the first record, the second field of the second record, and so on. For the twentieth record, field number 20 is printed; most likely, the record has fewer than 20 fields, so this prints a blank line. Here is another example of using expressions as field numbers:

awk '{ print $(2*2) }' mail-list

awk evaluates the expression ‘(2*2)’ and uses its value as the number of the field to print. The ‘*’ represents multiplication, so the expression ‘2*2’ evaluates to four. The parentheses are used so that the multiplication is done before the ‘$’ operation; they are necessary whenever there is a binary operator^[19] in the field-number expression. This example, then, prints the type of relationship (the fourth field) for every line of the file mail-list. (All of the awk operators are listed, in order of decreasing precedence, in Operator Precedence (How Operators Nest).)

If the field number you compute is zero, you get the entire record. Thus, ‘$(2-2)’ has the same value as $0. Negative field numbers are not allowed; trying to reference one usually terminates the program. (The POSIX standard does not define what happens when you reference a negative field number. gawk notices this and terminates your program. Other awk implementations may behave differently.)

As mentioned in Examining Fields, awk stores the current record’s number of fields in the built-in variable NF (also see Predefined Variables). Thus, the expression $NF is not a special feature—it is the direct consequence of evaluating NF and using its value as a field number.

Changing the Contents of a Field

The contents of a field, as seen by awk, can be changed within an awk program; this changes what awk perceives as the current input record. (The actual input is untouched; awk never modifies the input file.) Consider the following example and its output:

$ awk '{ nboxes = $3 ; $3 = $3 - 10
>        print nboxes, $3 }' inventory-shipped
25 15
32 22
24 14
…

The program first saves the original value of field three in the variable nboxes. The ‘-’ sign represents subtraction, so this program reassigns field three, $3, as the original value of field three minus ten: ‘$3 - 10’. (See Arithmetic Operators.) Then it prints the original and new values for field three. (Someone in the warehouse made a consistent mistake while inventorying the red boxes.)

For this to work, the text in $3 must make sense as a number; the string of characters must be converted to a number for the computer to do arithmetic on it. The number resulting from the subtraction is converted back to a string of characters that then becomes field three. See Conversion of Strings and Numbers.

When the value of a field is changed (as perceived by awk), the text of the input record is recalculated to contain the new field where the old one was. In other words, $0 changes to reflect the altered field. Thus, this program prints a copy of the input file, with 10 subtracted from the second field of each line:

$ awk '{ $2 = $2 - 10; print $0 }' inventory-shipped
Jan 3 25 15 115
Feb 5 32 24 226
Mar 5 24 34 228
…

It is also possible to assign contents to fields that are out of range. For example:

$ awk '{ $6 = ($5 + $4 + $3 + $2)
>        print $6 }' inventory-shipped
166
297
301
…

We’ve just created $6, whose value is the sum of fields $2, $3, $4, and $5. The ‘+’ sign represents addition. For the file inventory-shipped, $6 represents the total number of parcels shipped for a particular month.

Creating a new field changes awk’s internal copy of the current input record, which is the value of $0. Thus, if you do ‘print $0’ after adding a field, the record printed includes the new field, with the appropriate number of field separators between it and the previously existing fields.

This recomputation affects and is affected by NF (the number of fields; see Examining Fields). For example, the value of NF is set to the number of the highest field you create. The exact format of $0 is also affected by a feature that has not been discussed yet: the output field separator, OFS, used to separate the fields (see Output Separators).

Note, however, that merely referencing an out-of-range field does not change the value of either $0 or NF. Referencing an out-of-range field only produces an empty string. For example:

if ($(NF+1) != "")
    print "can't happen"
else
    print "everything is normal"

should print ‘everything is normal’, because NF+1 is certain to be out of range. (See The if-else Statement for more information about awk’s if-else statements. See Variable Typing and Comparison Expressions for more information about the ‘!=’ operator.)

It is important to note that making an assignment to an existing field changes the value of $0 but does not change the value of NF, even when you assign the empty string to a field. For example:

$ echo a b c d | awk '{ OFS = ":"; $2 = ""
>                       print $0; print NF }'
a::c:d
4

The field is still there; it just has an empty value, delimited by the two colons between ‘a’ and ‘c’. This example shows what happens if you create a new field:

$ echo a b c d | awk '{ OFS = ":"; $2 = ""; $6 = "new"
>                       print $0; print NF }'
a::c:d::new
6

The intervening field, $5, is created with an empty value (indicated by the second pair of adjacent colons), and NF is updated with the value six.

Decrementing NF throws away the values of the fields after the new value of NF and recomputes $0. (d.c.) Here is an example:

$ echo a b c d e f | awk '{ print "NF =", NF;
>                           NF = 3; print $0 }'
NF = 6
a b c

Caution

Some versions of awk don’t rebuild $0 when NF is decremented.

Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current values of the fields and OFS. To do this, use the seemingly innocuous assignment:

$1 = $1   # force record to be reconstituted
print $0  # or whatever else with $0

This forces awk to rebuild the record. It does help to add a comment, as we’ve shown here.

There is a flip side to the relationship between $0 and the fields. Any assignment to $0 causes the record to be reparsed into fields using the current value of FS. This also applies to any built-in function that updates $0, such as sub() and gsub() (see String-Manipulation Functions).

It is important to remember that $0 is the full record, exactly as it was read from the input. This includes any leading or trailing whitespace, and the exact whitespace (or other characters) that separates the fields.

It is a common error to try to change the field separators in a record simply by setting FS and OFS, and then expecting a plain ‘print’ or ‘print $0’ to print the modified record.

But this does not work, because nothing was done to change the record itself. Instead, you must force the record to be rebuilt, typically with a statement such as ‘$1 = $1’, as described earlier.

Specifying How Fields Are Separated

The field separator, which is either a single character or a regular expression, controls the way awk splits an input record into fields. awk scans the input record for character sequences that match the separator; the fields themselves are the text between the matches.

In the examples that follow, we use the bullet symbol (•) to represent spaces in the output. If the field separator is ‘oo’, then the following line:

moo goo gai pan

is split into three fields: ‘m’, ‘•g’, and ‘•gai•pan’. Note the leading spaces in the values of the second and third fields.

The field separator is represented by the predefined variable FS. Shell programmers take note: awk does not use the name IFS that is used by the POSIX-compliant shells (such as the Unix Bourne shell, sh, or Bash).

The value of FS can be changed in the awk program with the assignment operator, ‘=’ (see Assignment Expressions). Often, the right time to do this is at the beginning of execution before any input has been processed, so that the very first record is read with the proper separator. To do this, use the special BEGIN pattern (see The BEGIN and END Special Patterns). For example, here we set the value of FS to the string ",":

awk 'BEGIN { FS = "," } ; { print $2 }'

Given the input line:

John Q. Smith, 29 Oak St., Walamazoo, MI 42139

this awk program extracts and prints the string ‘•29•Oak•St.’.

Sometimes the input data contains separator characters that don’t separate fields the way you thought they would. For instance, the person’s name in the example we just used might have a title or suffix attached, such as:

John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139

The same program would extract ‘•LXIX’ instead of ‘•29•Oak•St.’. If you were expecting the program to print the address, you would be surprised. The moral is to choose your data layout and separator characters carefully to prevent such problems. (If the data is not in a form that is easy to process, perhaps you can massage it first with a separate awk program.)

Whitespace Normally Separates Fields

Fields are normally separated by whitespace sequences (spaces, TABs, and newlines), not by single spaces. Two spaces in a row do not delimit an empty field. The default value of the field separator FS is a string containing a single space, " ". If awk interpreted this value in the usual way, each space character would separate fields, so two spaces in a row would make an empty field between them. The reason this does not happen is that a single space as the value of FS is a special case—it is taken to specify the default manner of delimiting fields.

If FS is any other single character, such as ",", then each occurrence of that character separates two fields. Two consecutive occurrences delimit an empty field. If the character occurs at the beginning or the end of the line, that too delimits an empty field. The space character is the only single character that does not follow these rules.

Using Regular Expressions to Separate Fields

The previous subsection discussed the use of single characters or simple strings as the value of FS. More generally, the value of FS may be a string containing any regular expression. In this case, each match in the record for the regular expression separates fields. For example, the assignment:

FS = ", \t"

makes every area of an input line that consists of a comma followed by a space and a TAB into a field separator.

For a less trivial example of a regular expression, try using single spaces to separate fields the way single commas are used. FS can be set to "[ ]" (left bracket, space, right bracket). This regular expression matches a single space and nothing else (see Chapter 3).

There is an important difference between the two cases of ‘FS = " "’ (a single space) and ‘FS = "[ \t\n]+"’ (a regular expression matching one or more spaces, TABs, or newlines). For both values of FS, fields are separated by runs (multiple adjacent occurrences) of spaces, TABs, and/or newlines. However, when the value of FS is " ", awk first strips leading and trailing whitespace from the record and then decides where the fields are. For example, the following pipeline prints ‘b’:

$ echo ' a b c d ' | awk '{ print $2 }'
b

However, this pipeline prints ‘a’ (note the extra spaces around each letter):

$ echo ' a  b  c  d ' | awk 'BEGIN { FS = "[ \t\n]+" }
>                                  { print $2 }'
a

In this case, the first field is null, or empty.

The stripping of leading and trailing whitespace also comes into play whenever $0 is recomputed. For instance, study this pipeline:

$ echo '   a b c d' | awk '{ print; $2 = $2; print }'
   a b c d
a b c d

The first print statement prints the record as it was read, with leading whitespace intact. The assignment to $2 rebuilds $0 by concatenating $1 through $NF together, separated by the value of OFS (which is a space by default). Because the leading whitespace was ignored when finding $1, it is not part of the new $0. Finally, the last print statement prints the new $0.

There is an additional subtlety to be aware of when using regular expressions for field splitting. It is not well specified in the POSIX standard, or anywhere else, what ‘^’ means when splitting fields. Does the ‘^’ match only at the beginning of the entire record? Or is each field separator a new string? It turns out that different awk versions answer this question differently, and you should not rely on any specific behavior in your programs. (d.c.)

As a point of information, BWK awk allows ‘^’ to match only at the beginning of the record. gawk also works this way. For example:

$ echo 'xxAA  xxBxx  C' |
> gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
>                             printf "-->%s<--\n", $i }'
--><--
-->AA<--
-->xxBxx<--
-->C<--

Making Each Character a Separate Field

There are times when you may want to examine each character of a record separately. This can be done in gawk by simply assigning the null string ("") to FS. (c.e.) In this case, each individual character in the record becomes a separate field. For example:

$ echo a b | gawk 'BEGIN { FS = "" }
>                  {
>                      for (i = 1; i <= NF; i = i + 1)
>                          print "Field", i, "is", $i
>                  }'
Field 1 is a
Field 2 is
Field 3 is b

Traditionally, the behavior of FS equal to "" was not defined. In this case, most versions of Unix awk simply treat the entire record as only having one field. (d.c.) In compatibility mode (see Command-Line Options), if FS is the null string, then gawk also behaves this way.

Setting FS from the Command Line

FS can be set on the command line. Use the -F option to do so. For example:

awk -F, 'program' input-files

sets FS to the ‘,’ character. Notice that the option uses an uppercase ‘F’ instead of a lowercase ‘f’. The latter option (-f) specifies a file containing an awk program.

The value used for the argument to -F is processed in exactly the same way as assignments to the predefined variable FS. Any special characters in the field separator must be escaped appropriately. For example, to use a ‘\’ as the field separator on the command line, you would have to type:

# same as FS = "\\"
awk -F\\\\ '…' files …

Because ‘\’ is used for quoting in the shell, awk sees ‘-F\\’. Then awk processes the ‘\\’ for escape characters (see Escape Sequences), finally yielding a single ‘\’ to use for the field separator.

As a special case, in compatibility mode (see Command-Line Options), if the argument to -F is ‘t’, then FS is set to the TAB character. If you type ‘-F\t’ at the shell, without any quotes, the ‘\’ gets deleted, so awk figures that you really want your fields to be separated with TABs and not ‘t’s. Use ‘-v FS="t"’ or ‘-F"[t]"’ on the command line if you really do want to separate your fields with ‘t’s. Use ‘-F '\t'’ when not in compatibility mode to specify that TABs separate fields.

As an example, let’s use an awk program file called edu.awk that contains the pattern /edu/ and the action ‘print $1’:

/edu/   { print $1 }

Let’s also set FS to be the ‘-’ character and run the program on the file mail-list. The following a university, and the first three digits of their phone numbers:

$ awk -F- -f edu.awk mail-list
Fabius       555
Samuel       555
Jean

Note the third line of output. The third line in the original file looked like this:

Jean-Paul    555-2127     jeanpaul.campanorum@nyu.edu     R

The ‘-’ as part of the person’s name was used as the field separator, instead of the ‘-’ in the phone number that was originally intended. This demonstrates why you have to be careful in choosing your field and record separators.

Perhaps the most common use of a single character as the field separator occurs when processing the Unix system password file. On many Unix systems, each user has a separate entry in the system password file, with one line per user. The information in these lines is separated by colons. The first field is the user’s login name and the second is the user’s encrypted or shadow password. (A shadow password is indicated by the presence of a single ‘x’ in the second field.) A password file entry might look like this:

arnold:x:2076:10:Arnold Robbins:/home/arnold:/bin/bash

The following program searches the system password file and prints the entries for users whose full name is not indicated:

awk -F: '$5 == ""' /etc/passwd

Making the Full Line Be a Single Field

Occasionally, it’s useful to treat the whole input line as a single field. This can be done easily and portably simply by setting FS to "\n" (a newline):^[20]

awk -F'\n' 'program' files …

When you do this, $1 is the same as $0.

According to the POSIX standard, awk is supposed to behave as if each record is split into fields at the time it is read. In particular, this means that if you change the value of FS after a record is read, the values of the fields (i.e., how they were split) should reflect the old value of FS, not the new one.

However, many older implementations of awk do not work this way. Instead, they defer splitting the fields until a field is actually referenced. The fields are split using the current value of FS! (d.c.) This behavior can be difficult to diagnose. The following example illustrates the difference between the two methods:

sed 1q /etc/passwd | awk '{ FS = ":" ; print $1 }'

which usually prints:

root

on an incorrect implementation of awk, while gawk prints the full first line of the file, something like:

root:x:0:0:Root:/:

(The sed^[21] command prints just the first line of /etc/passwd.)

Field-Splitting Summary

It is important to remember that when you assign a string constant as the value of FS, it undergoes normal awk string processing. For example, with Unix awk and gawk, the assignment ‘FS = "\.."’ assigns the character string ".." to FS (the backslash is stripped). This creates a regexp meaning “fields are separated by occurrences of any two characters.” If instead you want fields to be separated by a literal period followed by any single character, use ‘FS = "\\.."’.

The following list summarizes how fields are split, based on the value of FS (‘==’ means “is equal to”):

FS == " ": Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default.
FS == any other single character: Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty fields, as do leading and trailing occurrences. The character can even be a regexp metacharacter; it does not need to be escaped.
FS == regexp: Fields are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty fields.
FS == "": Each individual character in the record becomes a separate field. (This is a common extension; it is not specified by the POSIX standard.)

The IGNORECASE variable (see Built-in Variables That Control awk) affects field splitting only when the value of FS is a regexp. It has no effect when FS is a single character, even if that character is a letter. Thus, in the following code:

FS = "c"
IGNORECASE = 1
$0 = "aCa"
print $1

The output is ‘aCa’. If you really want to split fields on an alphabetic character while ignoring case, use a regexp that will do it for you (e.g., ‘FS = "[c]"’). In this case, IGNORECASE will take effect.

Reading Fixed-Width Data

This section discusses an advanced feature of gawk. If you are a novice awk user, you might want to skip it on the first reading.

gawk provides a facility for dealing with fixed-width fields with no distinctive field separator. For example, data of this nature arises in the input for old Fortran programs where numbers are run together, or in the output of programs that did not anticipate the use of their output as input for other programs.

An example of the latter is a table where all the columns are lined up by the use of a variable number of spaces and empty fields are just spaces. Clearly, awk’s normal field splitting based on FS does not work well in this case. Although a portable awk program can use a series of substr() calls on $0 (see String-Manipulation Functions), this is awkward and inefficient for a large number of fields.

The splitting of an input record into fixed-width fields is specified by assigning a string containing space-separated numbers to the built-in variable FIELDWIDTHS. Each number specifies the width of the field, including columns between fields. If you want to ignore the columns between fields, you can specify the width as a separate field that is subsequently ignored. It is a fatal error to supply a field width that is not a positive number. The following data is the output of the Unix w utility. It is useful to illustrate the use of FIELDWIDTHS:

 10:06pm  up 21 days, 14:04,  23 users
User     tty       login  idle   JCPU   PCPU  what
hzuo     ttyV0     8:58pm            9      5  vi p24.tex
hzang    ttyV3     6:37pm    50                -csh
eklye    ttyV5     9:53pm            7      1  em thes.tex
dportein ttyV6     8:17pm  1:47                -csh
gierd    ttyD3    10:00pm     1                elm
dave     ttyD4     9:47pm            4      4  w
brent    ttyp0    26Jun91  4:46  26:46   4:41  bash
dave     ttyq4    26Jun9115days     46     46  wnewmail

The following program takes this input, converts the idle time to number of seconds, and prints out the first two fields and the calculated idle time:

BEGIN  { FIELDWIDTHS = "9 6 10 6 7 7 35" }
NR > 2 {
    idle = $4
    sub(/^ +/, "", idle)   # strip leading spaces
    if (idle == "")
        idle = 0
    if (idle ~ /:/) {
        split(idle, t, ":")
        idle = t[1] * 60 + t[2]
    }
    if (idle ~ /days/)
        idle *= 24 * 60 * 60

    print $1, $2, idle
}

Note

The preceding program uses a number of awk features that haven’t been introduced yet.

Running the program on the data produces the following results:

hzuo      ttyV0  0
hzang     ttyV3  50
eklye     ttyV5  0
dportein  ttyV6  107
gierd     ttyD3  1
dave      ttyD4  0
brent     ttyp0  286
dave      ttyq4  1296000

Another (possibly more practical) example of fixed-width input data is the input from a deck of balloting cards. In some parts of the United States, voters mark their choices by punching holes in computer cards. These cards are then processed to count the votes for any particular candidate or on any particular issue. Because a voter may choose not to vote on some issue, any column on the card may be empty. An awk program for processing such data could use the FIELDWIDTHS feature to simplify reading the data. (Of course, getting gawk to run on a system with card readers is another story!)

Assigning a value to FS causes gawk to use FS for field splitting again. Use ‘FS = FS’ to make this happen, without having to know the current value of FS. In order to tell which kind of field splitting is in effect, use PROCINFO["FS"] (see Built-in Variables That Convey Information). The value is "FS" if regular field splitting is being used, or is "FIELDWIDTHS" if fixed-width field splitting is being used:

if (PROCINFO["FS"] == "FS")
    regular field splitting …
else if  (PROCINFO["FS"] == "FIELDWIDTHS")
    fixed-width field splitting …
else
    content-based field splitting … (see next section)

This information is useful when writing a function that needs to temporarily change FS or FIELDWIDTHS, read some records, and then restore the original settings (see Reading the User Database for an example of such a function).

Defining Fields by Content

This section discusses an advanced feature of gawk. If you are a novice awk user, you might want to skip it on the first reading.

Normally, when using FS, gawk defines the fields as the parts of the record that occur in between each field separator. In other words, FS defines what a field is not, instead of what a field is. However, there are times when you really want to define the fields by what they are, and not by what they are not.

The most notorious such case is so-called comma-separated values (CSV) data. Many spreadsheet programs, for example, can export their data into text files, where each record is terminated with a newline, and fields are separated by commas. If commas only separated the data, there wouldn’t be an issue. The problem comes when one of the fields contains an embedded comma. In such cases, most programs embed the field in double quotes.^[22] So, we might have data like this:

Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA

The FPAT variable offers a solution for cases like this. The value of FPAT should be a string that provides a regular expression. This regular expression describes the contents of each field.

In the case of CSV data as presented here, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant (see Chapter 3), we would have /([^,]+)|("[^"]+")/. Writing this as a string requires us to escape the double quotes, leading to:

FPAT = "([^,]+)|(\"[^\"]+\")"

Putting this to use, here is a simple program to parse the data:

BEGIN {
    FPAT = "([^,]+)|(\"[^\"]+\")"
}

{
    print "NF = ", NF
    for (i = 1; i <= NF; i++) {
        printf("$%d = <%s>\n", i, $i)
    }
}

When run, we get the following:

$ gawk -f simple-csv.awk addresses.csv
NF =  7
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 A Pretty Street, NE">
$4 = <MyTown>
$5 = <MyState>
$6 = <12345-6789>
$7 = <USA>

Note the embedded comma in the value of $3.

A straightforward improvement when processing CSV data of this sort would be to remove the quotes when they occur, with something like this:

if (substr($i, 1, 1) == "\"") {
    len = length($i)
    $i = substr($i, 2, len - 2)    # Get text within the two quotes
}

As with FS, the IGNORECASE variable (see Built-in Variables That Control awk) affects field splitting with FPAT.

Assigning a value to FPAT overrides field splitting with FS and with FIELDWIDTHS. Similar to FIELDWIDTHS, the value of PROCINFO["FS"] will be "FPAT" if content-based field splitting is being used.

Note

Some programs export CSV data that contains embedded newlines between the double quotes. gawk provides no way to deal with this. Even though a formal specification for CSV data exists, there isn’t much more to be done; the FPAT mechanism provides an elegant solution for the majority of cases, and the gawk developers are satisfied with that.

As written, the regexp used for FPAT requires that each field contain at least one character. A straightforward modification (changing the first ‘+’ to ‘*’) allows fields to be empty:

FPAT = "([^,]*)|(\"[^\"]+\")"

Finally, the patsplit() function makes the same functionality available for splitting regular strings (see String-Manipulation Functions).

To recap, gawk provides three independent methods to split input records into fields. The mechanism used is based on which of the three variables—FS, FIELDWIDTHS, or FPAT—was last assigned to.

Multiple-Line Records

In some databases, a single line cannot conveniently hold all the information in one entry. In such cases, you can use multiline records. The first step in doing this is to choose your data format.

One technique is to use an unusual character or string to separate records. For example, you could use the formfeed character (written ‘\f’ in awk, as in C) to separate them, making each record a page of the file. To do this, just set the variable RS to "\f" (a string containing the formfeed character). Any other character could equally well be used, as long as it won’t be part of the data in a record.

Another technique is to have blank lines separate records. By a special dispensation, an empty string as the value of RS indicates that records are separated by one or more blank lines. When RS is set to the empty string, each record always ends at the first blank line encountered. The next record doesn’t start until the first nonblank line that follows. No matter how many blank lines appear in a row, they all act as one record separator. (Blank lines must be completely empty; lines that contain only whitespace do not count.)

You can achieve the same effect as ‘RS = ""’ by assigning the string "\n\n+" to RS. This regexp matches the newline at the end of the record and one or more blank lines after the record. In addition, a regular expression always matches the longest possible sequence when there is a choice (see How Much Text Matches?). So, the next record doesn’t start until the first nonblank line that follows—no matter how many blank lines appear in a row, they are considered one record separator.

However, there is an important difference between ‘RS = ""’ and ‘RS = "\n\n+"’. In the first case, leading newlines in the input datafile are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done. (d.c.)

Now that the input is separated into records, the second step is to separate the fields in the records. One way to do this is to divide each of the lines into fields in the normal manner. This happens by default as the result of a special feature. When RS is set to the empty string and FS is set to a single character, the newline character always acts as a field separator. This is in addition to whatever field separations result from FS.^[23]

The original motivation for this special exception was probably to provide useful behavior in the default case (i.e., FS is equal to " "). This feature can be a problem if you really don’t want the newline character to separate fields, because there is no way to prevent it. However, you can work around this by using the split() function to break up the record manually (see String-Manipulation Functions). If you have a single-character field separator, you can work around the special feature in a different way, by making FS into a regexp for that single character. For example, if the field separator is a percent character, instead of ‘FS = "%"’, use ‘FS = "[%]"’.

Another way to separate fields is to put each field on a separate line: to do this, just set the variable FS to the string "\n". (This single-character separator matches a single newline.) A practical example of a datafile organized this way might be a mailing list, where blank lines separate the entries. Consider a mailing list in a file named addresses, which looks like this:

Jane Doe
123 Main Street
Anywhere, SE 12345-6789

John Smith
456 Tree-lined Avenue
Smallville, MW 98765-4321
…

A simple program to process this file is as follows:

# addrs.awk --- simple mailing list program

# Records are separated by blank lines.
# Each line is one field.
BEGIN { RS = "" ; FS = "\n" }

{
      print "Name is:", $1
      print "Address is:", $2
      print "City and State are:", $3
      print ""
}

Running the program produces the following output:

$ awk -f addrs.awk addresses
Name is: Jane Doe
Address is: 123 Main Street
City and State are: Anywhere, SE 12345-6789
Name is: John Smith
Address is: 456 Tree-lined Avenue
City and State are: Smallville, MW 98765-4321

…

See Printing Mailing Labels for a more realistic program dealing with address lists. The following list summarizes how records are split, based on the value of RS:

RS == "\n": Records are separated by the newline character (‘\n’). In effect, every line in the datafile is a separate record, including blank lines. This is the default.
RS == any single character: Records are separated by each occurrence of the character. Multiple successive occurrences delimit empty records.
RS == "": Records are separated by runs of blank lines. When FS is a single character, then the newline character always serves as a field separator, in addition to whatever value FS may have. Leading and trailing newlines in a file are ignored.
RS == regexp: Records are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty records. (This is a gawk extension; it is not specified by the POSIX standard.)

If not in compatibility mode (see Command-Line Options), gawk sets RT to the input text that matched the value specified by RS. But if the input file ended without any text that matches RS, then gawk sets RT to the null string.

Explicit Input with getline

So far we have been getting our input data from awk’s main input stream—either the standard input (usually your keyboard, sometimes the output from another program) or the files specified on the command line. The awk language has a special built-in command called getline that can be used to read input under your explicit control.

The getline command is used in several different ways and should not be used by beginners. The examples that follow the explanation of the getline command include material that has not been covered yet. Therefore, come back and study the getline command after you have reviewed the rest of Parts I and II and have a good knowledge of how awk works.

The getline command returns 1 if it finds a record and 0 if it encounters the end of the file. If there is some error in getting a record, such as a file that cannot be opened, then getline returns −1. In this case, gawk sets the variable ERRNO to a string describing the error that occurred.

In the following examples, command stands for a string value that represents a shell command.

Note

When --sandbox is specified (see Command-Line Options), reading lines from files, pipes, and coprocesses is disabled.

Using getline with No Arguments

The getline command can be used without arguments to read input from the current input file. All it does in this case is read the next input record and split it up into fields. This is useful if you’ve finished processing the current record, but want to do some special processing on the next record right now. For example:

# Remove text between /* and */, inclusive
{
    if ((i = index($0, "/*")) != 0) {
        out = substr($0, 1, i - 1)  # leading part of the string
        rest = substr($0, i + 2)    # ... */ ...
        j = index(rest, "*/")       # is */ in trailing part?
        if (j > 0) {
            rest = substr(rest, j + 2)  # remove comment
        } else {
            while (j == 0) {
                # get more text
                if (getline <= 0) {
                    print("unexpected EOF or error:", ERRNO) > "/dev/stderr"
                    exit
                }
                # build up the line using string concatenation
                rest = rest $0
                j = index(rest, "*/")   # is */ in trailing part?
                if (j != 0) {
                    rest = substr(rest, j + 2)
                    break
                }
            }
        }
        # build up the output line using string concatenation
        $0 = out rest
    }
    print $0
}

This awk program deletes C-style comments (‘/* … */’) from the input. It uses a number of features we haven’t covered yet, including string concatenation (see String Concatenation) and the index() and substr() built-in functions (see String-Manipulation Functions). By replacing the ‘print $0’ with other statements, you could perform more complicated processing on the decommented input, such as searching for matches of a regular expression. (This program has a subtle problem—it does not work if one comment ends and another begins on the same line.)

This form of the getline command sets NF, NR, FNR, RT, and the value of $0.

Note

The new value of $0 is used to test the patterns of any subsequent rules. The original value of $0 that triggered the rule that executed getline is lost. By contrast, the next statement reads a new record but immediately begins processing it normally, starting with the first rule in the program. See The next Statement.

Using getline into a Variable

You can use ‘getline var’ to read the next record from awk’s input into the variable var. No other processing is done. For example, suppose the next line is a comment or a special string, and you want to read it without triggering any rules. This form of getline allows you to read that line and store it in a variable so that the main read-a-line-and-check-each-rule loop of awk never sees it. The following example swaps every two lines of input:

{
     if ((getline tmp) > 0) {
          print tmp
          print $0
     } else
          print $0
}

It takes the following list:

wan
tew
free
phore

and produces these results:

tew
wan
phore
free

The getline command used in this way sets only the variables NR, FNR, and RT (and, of course, var). The record is not split into fields, so the values of the fields (including $0) and the value of NF do not change.

Using getline from a File

Use ‘getline < file’ to read the next record from file. Here, file is a string-valued expression that specifies the filename. ‘< file’ is called a redirection because it directs input to come from a different place. For example, the following program reads its input record from the file secondary.input when it encounters a first field with a value equal to 10 in the current input file:

{
    if ($1 == 10) {
         getline < "secondary.input"
         print
    } else
         print
}

Because the main input stream is not used, the values of NR and FNR are not changed. However, the record it reads is split into fields in the normal manner, so the values of $0 and the other fields are changed, resulting in a new value of NF. RT is also set.

According to POSIX, ‘getline < expression’ is ambiguous if expression contains unparenthesized operators other than ‘$’; for example, ‘getline < dir "/" file’ is ambiguous because the concatenation operator (not discussed yet; see String Concatenation) is not parenthesized. You should write it as ‘getline < (dir "/" file)’ if you want your program to be portable to all awk implementations.

Using getline into a Variable from a File

Use ‘getline var < file’ to read input from the file file, and put it in the variable var. As earlier, file is a string-valued expression that specifies the file from which to read.

In this version of getline, none of the predefined variables are changed and the record is not split into fields. The only variable changed is var.^[24] For example, the following program copies all the input files to the output, except for records that say ‘@include filename’. Such a record is replaced by the contents of the file filename:

{
     if (NF == 2 && $1 == "@include") {
          while ((getline line < $2) > 0)
               print line
          close($2)
     } else
          print
}

Note here how the name of the extra input file is not built into the program; it is taken directly from the data, specifically from the second field on the @include line.

The close() function is called to ensure that if two identical @include lines appear in the input, the entire specified file is included twice. See Closing Input and Output Redirections.

One deficiency of this program is that it does not process nested @include statements (i.e., @include statements in included files) the way a true macro preprocessor would. See An Easy Way to Use Library Functions for a program that does handle nested @include statements.

Using getline from a Pipe

Omniscience has much to recommend it. Failing that, attention to details would be useful.
—Brian Kernighan

The output of a command can also be piped into getline, using ‘command | getline’. In this case, the string command is run as a shell command and its output is piped into awk to be used as input. This form of getline reads one record at a time from the pipe. For example, the following program copies its input to its output, except for lines that begin with ‘@execute’, which are replaced by the output produced by running the rest of the line as a shell command:

{
     if ($1 == "@execute") {
          tmp = substr($0, 10)        # Remove "@execute"
          while ((tmp | getline) > 0)
               print
          close(tmp)
     } else
          print
}

The close() function is called to ensure that if two identical ‘@execute’ lines appear in the input, the command is run for each one. Given the input:

foo
bar
baz
@execute who
bletch

the program might produce:

foo
bar
baz
arnold     ttyv0   Jul 13 14:22
miriam     ttyp0   Jul 13 14:23     (murphy:0)
bill       ttyp1   Jul 13 14:23     (murphy:0)
bletch

Notice that this program ran the command who and printed the result. (If you try this program yourself, you will of course get different results, depending upon who is logged in on your system.)

This variation of getline splits the record into fields, sets the value of NF, and recomputes the value of $0. The values of NR and FNR are not changed. RT is set.

According to POSIX, ‘expression | getline’ is ambiguous if expression contains unparenthesized operators other than ‘$’—for example, ‘"echo " "date" | getline’ is ambiguous because the concatenation operator is not parenthesized. You should write it as ‘("echo " "date") | getline’ if you want your program to be portable to all awk implementations.

Note

Unfortunately, gawk has not been consistent in its treatment of a construct like ‘"echo " "date" | getline’. Most versions, including the current version, treat it at as ‘("echo " "date") | getline’. (This is also how BWK awk behaves.) Some versions instead treat it as ‘"echo " ("date" | getline)’. (This is how mawk behaves.) In short, always use explicit parentheses, and then you won’t have to worry.

Using getline into a Variable from a Pipe

When you use ‘command | getline var’, the output of command is sent through a pipe to getline and into the variable var. For example, the following program reads the current date and time into the variable current_time, using the date utility, and then prints it:

BEGIN {
     "date" | getline current_time
     close("date")
     print "Report printed on " current_time
}

In this version of getline, none of the predefined variables are changed and the record is not split into fields. However, RT is set.

Using getline from a Coprocess

Reading input into getline from a pipe is a one-way operation. The command that is started with ‘command | getline’ only sends data to your awk program.

On occasion, you might want to send data to another program for processing and then read the results back. gawk allows you to start a coprocess, with which two-way communications are possible. This is done with the ‘|&’ operator. Typically, you write data to the coprocess first and then read the results back, as shown in the following:

print "some query" |& "db_server"
"db_server" |& getline

which sends a query to db_server and then reads the results.

The values of NR and FNR are not changed, because the main input stream is not used. However, the record is split into fields in the normal manner, thus changing the values of $0, of the other fields, and of NF and RT.

Coprocesses are an advanced feature. They are discussed here only because this is the section on getline. See Two-Way Communications with Another Process, where coprocesses are discussed in more detail.

Using getline into a Variable from a Coprocess

When you use ‘command |& getline var’, the output from the coprocess command is sent through a two-way pipe to getline and into the variable var.

In this version of getline, none of the predefined variables are changed and the record is not split into fields. The only variable changed is var. However, RT is set.

Points to Remember About getline

Here are some miscellaneous points about getline that you should bear in mind:

When getline changes the value of $0 and NF, awk does not automatically jump to the start of the program and start testing the new record against every pattern. However, the new record is tested against any subsequent rules.
Some very old awk implementations limit the number of pipelines that an awk program may have open to just one. In gawk, there is no such limit. You can open as many pipelines (and coprocesses) as the underlying operating system permits.
An interesting side effect occurs if you use getline without a redirection inside a BEGIN rule. Because an unredirected getline reads from the command-line datafiles, the first getline command causes awk to set the value of FILENAME. Normally, FILENAME does not have a value inside BEGIN rules, because you have not yet started to process the command-line datafiles. (d.c.) (See The BEGIN and END Special Patterns; also see Built-in Variables That Convey Information.)
Using FILENAME with getline (‘getline < FILENAME’) is likely to be a source of confusion. awk opens a separate input stream from the current input file. However, by not using a variable, $0 and NF are still updated. If you’re doing this, it’s probably by accident, and you should reconsider what it is you’re trying to accomplish.
The next section presents a table summarizing the getline variants and which variables they can affect. It is worth noting that those variants that do not use redirection can cause FILENAME to be updated if they cause awk to start reading a new input file.
If the variable being assigned is an expression with side effects, different versions of awk behave differently upon encountering end-of-file. Some versions don’t evaluate the expression; many versions (including gawk) do. Here is an example, courtesy of Duncan Moore:
```
BEGIN {
    system("echo 1 > f")
    while ((getline a[++c] < "f") > 0) { }
    print c
}
```
Here, the side effect is the ‘++c’. Is c incremented if end-of-file is encountered before the element in a is assigned?
gawk treats getline like a function call, and evaluates the expression ‘a[++c]’ before attempting to read from f. However, some versions of awk only evaluate the expression once they know that there is a string value to be assigned.

Summary of getline Variants

Table 4-1 summarizes the eight variants of getline, listing which predefined variables are set by each one, and whether the variant is standard or a gawk extension. Note: for each variant, gawk sets the RT predefined variable.

Table 4-1. getline variants and what they set

Variant	Effect	awk / gawk
`getline`	Sets `$0`, `NF`, `FNR`, `NR`, and `RT`	`awk`
`getline` `var`	Sets `var`, `FNR`, `NR`, and `RT`	`awk`
`getline < file`	Sets `$0`, `NF`, and `RT`	`awk`
`getline var < file`	Sets `var` and `RT`	`awk`
`command` `\| getline`	Sets `$0`, `NF`, and `RT`	`awk`
`command` `\| getline` `var`	Sets `var` and `RT`	`awk`
`command` `\|& getline`	Sets `$0`, `NF`, and `RT`	`gawk`
`command` `\|& getline` `var`	Sets `var` and `RT`	`gawk`

Reading Input with a Timeout

This section describes a feature that is specific to gawk.

You may specify a timeout in milliseconds for reading input from the keyboard, a pipe, or two-way communication, including TCP/IP sockets. This can be done on a per-input, per-command, or per-connection basis, by setting a special element in the PROCINFO array (see Built-in Variables That Convey Information):

PROCINFO["input_name", "READ_TIMEOUT"] = timeout in milliseconds

When set, this causes gawk to time out and return failure if no data is available to read within the specified timeout period. For example, a TCP client can decide to give up on receiving any response from the server after a certain amount of time:

Service = "/inet/tcp/0/localhost/daytime"
PROCINFO[Service, "READ_TIMEOUT"] = 100
if ((Service |& getline) > 0)
    print $0
else if (ERRNO != "")
    print ERRNO

Here is how to read interactively from the user^[25] without waiting for more than five seconds:

PROCINFO["/dev/stdin", "READ_TIMEOUT"] = 5000
while ((getline < "/dev/stdin") > 0)
    print $0

gawk terminates the read operation if input does not arrive after waiting for the timeout period, returns failure, and sets ERRNO to an appropriate string value. A negative or zero value for the timeout is the same as specifying no timeout at all.

A timeout can also be set for reading from the keyboard in the implicit loop that reads input records and matches them against patterns, like so:

$ gawk 'BEGIN { PROCINFO["-", "READ_TIMEOUT"] = 5000 }
> { print "You entered: " $0 }'
gawk
You entered: gawk

In this case, failure to respond within five seconds results in the following error message:

error→ gawk: cmd. line:2: (FILENAME=- FNR=1) fatal: error reading input file `-': 
    Connection timed out

The timeout can be set or changed at any time, and will take effect on the next attempt to read from the input device. In the following example, we start with a timeout value of one second, and progressively reduce it by one-tenth of a second until we wait indefinitely for the input to arrive:

PROCINFO[Service, "READ_TIMEOUT"] = 1000
while ((Service |& getline) > 0) {
    print $0
    PROCINFO[Service, "READ_TIMEOUT"] -= 100
}

Note

You should not assume that the read operation will block exactly after the tenth record has been printed. It is possible that gawk will read and buffer more than one record’s worth of data the first time. Because of this, changing the value of timeout like the preceding example is not very useful.

If the PROCINFO element is not present and the GAWK_READ_TIMEOUT environment variable exists, gawk uses its value to initialize the timeout value. The exclusive use of the environment variable to specify timeout has the disadvantage of not being able to control it on a per-command or per-connection basis.

gawk considers a timeout event to be an error even though the attempt to read from the underlying device may succeed in a later attempt. This is a limitation, and it also means that you cannot use this to multiplex input from two or more sources.

Assigning a timeout value prevents read operations from blocking indefinitely. But bear in mind that there are other ways gawk can stall waiting for an input device to be ready. A network client can sometimes take a long time to establish a connection before it can start reading any data, or the attempt to open a FIFO special file for reading can block indefinitely until some other process opens it for writing.

Directories on the Command Line

According to the POSIX standard, files named on the awk command line must be text files; it is a fatal error if they are not. Most versions of awk treat a directory on the command line as a fatal error.

By default, gawk produces a warning for a directory on the command line, but otherwise ignores it. This makes it easier to use shell wildcards with your awk program:

$ gawk -f whizprog.awk *        Directories could kill this program

If either of the --posix or --traditional options is given, then gawk reverts to treating a directory on the command line as a fatal error.

See Reading Directories for a way to treat directories as usable data from an awk program.

Summary

Input is split into records based on the value of RS. The possibilities are as follows:
Value of RS
Records are split on…
awk / gawk
Any single character
That character
awk
The empty string ("")
Runs of two or more newlines
awk
A regexp
Text that matches the regexp
gawk
FNR indicates how many records have been read from the current input file; NR indicates how many records have been read in total.
gawk sets RT to the text matched by RS.
After splitting the input into records, awk further splits the records into individual fields, named $1, $2, and so on. $0 is the whole record, and NF indicates how many fields there are. The default way to split fields is between whitespace characters.
Fields may be referenced using a variable, as in $NF. Fields may also be assigned values, which causes the value of $0 to be recomputed when it is later referenced. Assigning to a field with a number greater than NF creates the field and rebuilds the record, using OFS to separate the fields. Incrementing NF does the same thing. Decrementing NF throws away fields and rebuilds the record.

Value of RS	Records are split on…	awk / gawk
Any single character	That character	awk
The empty string (`""`)	Runs of two or more newlines	awk
A regexp	Text that matches the regexp	gawk

Field splitting is more complicated than record splitting:

Field separator value	Fields are split …	awk / gawk
`FS == " "`	On runs of whitespace	awk
`FS == any single character`	On that character	awk
`FS == regexp`	On text matching the regexp	awk
`FS == ""`	Such that each individual character is a separate field	gawk
`FIELDWIDTHS == list of columns`	Based on character position	gawk
`FPAT == regexp`	On the text surrounding text matching the regexp	gawk

Using ‘FS = "\n"’ causes the entire record to be a single field (assuming that newlines separate records).
FS may be set from the command line using the -F option. This can also be done using command-line variable assignment.
Use PROCINFO["FS"] to see how fields are being split.
Use getline in its various forms to read additional records, from the default input stream, from a file, or from a pipe or coprocess.
Use PROCINFO[file, "READ_TIMEOUT"] to cause reads to time out for file.
Directories on the command line are fatal for standard awk; gawk ignores them if not in POSIX mode.

^[17]At least that we know about.

^[18]In POSIX awk, newlines are not considered whitespace for separating fields.

^[19]A binary operator, such as ‘*’ for multiplication, is one that takes two operands. The distinction is required because awk also has unary (one-operand) and ternary (three-operand) operators.

^[20]Thanks to Andrew Schorr for this tip.

^[21]The sed utility is a “stream editor.” Its behavior is also defined by the POSIX standard.

^[22]The CSV format lacked a formal standard definition for many years. RFC 4180 standardizes the most common practices.

^[23]When FS is the null string ("") or a regexp, this special feature of RS does not apply. It does apply to the default field separator of a single space: ‘FS = " "’.

^[24]This is not quite true. RT could be changed if RS is a regular expression.

^[25]This assumes that standard input is the keyboard.

Get Effective awk Programming, 4th Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 4. Reading Input Files

How Input Is Split into Records

Record Splitting with Standard awk

Record Splitting with gawk

Note

Examining Fields

Nonconstant Field Numbers

Changing the Contents of a Field

Caution

Specifying How Fields Are Separated

Whitespace Normally Separates Fields

Using Regular Expressions to Separate Fields

Making Each Character a Separate Field

Setting FS from the Command Line

Making the Full Line Be a Single Field

Field-Splitting Summary

Reading Fixed-Width Data

Note

Defining Fields by Content

Note

Multiple-Line Records

Explicit Input with getline

Note

Using getline with No Arguments

Note

Using getline into a Variable

Using getline from a File

Using getline into a Variable from a File

Using getline from a Pipe

Note

Using getline into a Variable from a Pipe

Using getline from a Coprocess

Using getline into a Variable from a Coprocess

Points to Remember About getline

Summary of getline Variants

Reading Input with a Timeout

Note

Directories on the Command Line

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly