Chapter 4. Text Processing Tools

Some operations on text files are so widely applicable that standard tools for those tasks were developed early in the Unix work at Bell Labs. In this chapter, we look at the most important ones.

Sorting Text

Text files that contain independent records of data are often candidates for sorting. A predictable record order makes life easier for human users: book indexes, dictionaries, parts catalogs, and telephone directories have little value if they are unordered. Sorted records can also make programming easier and more efficient, as we will illustrate with the construction of an office directory in Chapter 5.

Like awk, cut, and join, sort views its input as a stream of records made up of fields of variable width, with records delimited by newline characters and fields delimited by whitespace or a user-specifiable single character.

Sorting by Lines

In the simplest case, when no command-line options are supplied, complete records are sorted according to the order defined by the current locale. In the traditional C locale, that means ASCII order, but you can set an alternate locale as we described in Section 2.8.

A tiny bilingual dictionary in the ISO 8859-1 encoding translates four French words differing only in accents:

$ cat french-english                         
               Show the tiny dictionary
côte    coast
cote    dimension
coté    dimensioned
côté    side

To understand the sorting, use the octal dump tool, od, to display the French words in ASCII and octal:

$ cut -f1 french-english | od -a -b          
               Display French words in octal bytes
0000000   c   t   t   e  nl   c   o   t   e  nl   c   o   t   i  nl   c
        143 364 164 145 012 143 157 164 145 012 143 157 164 351 012 143
0000020   t   t   i  nl
        364 164 351 012
0000024

Evidently, with the ASCII option -a, od strips the high-order bit of characters, so the accented letters have been mangled, but we can see their octal values: é is 3518 and ô is 3648.

On GNU/Linux systems, you can confirm the character values like this:

$ man iso_8859_1                             
               Check the ISO 8859-1 manual page
...
       Oct   Dec   Hex   Char   Description
       --------------------------------------------------------------------
...
       351   233   E9     é     LATIN SMALL LETTER E WITH ACUTE
...
       364   244   F4     ô     LATIN SMALL LETTER O WITH CIRCUMFLEX
...

First, sort the file in strict byte order:

$ LC_ALL=C sort french-english               
               Sort in traditional ASCII order
cote    dimension
coté    dimensioned
côte    coast
côté    side

Notice that e (1458) sorted before é (3518), and o (1578) sorted before ô (3648), as expected from their numerical values.

Now sort the text in Canadian-French order:

$ LC_ALL=fr_CA.iso88591 sort french-english        
               Sort in Canadian-French locale
côte    coast
cote    dimension
coté    dimensioned
côté    side

The output order clearly differs from the traditional ordering by raw byte values.

Sorting conventions are strongly dependent on language, country, and culture, and the rules are sometimes astonishingly complex. Even English, which mostly pretends that accents are irrelevant, can have complex sorting rules: examine your local telephone directory to see how lettercase, digits, spaces, punctuation, and name variants like McKay and Mackay are handled.

Sorting by Fields

For more control over sorting, the -k option allows you to specify the field to sort on, and the -t option lets you choose the field delimiter.

If -t is not specified, then fields are separated by whitespace and leading and trailing whitespace in the record is ignored. With the -t option, the specified character delimits fields, and whitespace is significant. Thus, a three-character record consisting of space-X-space has one field without -t, but three with -t' ' (the first and third fields are empty).

The -k option is followed by a field number, or number pair, optionally separated by whitespace after -k. Each number may be suffixed by a dotted character position, and/or one of the modifier letters shown in Table 4-1.

Table 4-1. Sort key field types

Letter

Description

b

Ignore leading whitespace.

d

Dictionary order.

f

Fold letters implicitly to a common lettercase.

g

Compare as general floating-point numbers. GNU version only.

i

Ignore nonprintable characters.

n

Compare as (integer) numbers.

r

Reverse the sort order.

Fields and characters within fields are numbered starting from one.

If only one field number is specified, the sort key begins at the start of that field, and continues to the end of the record (not the end of the field).

If a comma-separated pair of field numbers is given, the sort key starts at the beginning of the first field, and finishes at the end of the second field.

With a dotted character position, comparison begins (first of a number pair) or ends (second of a number pair) at that character position: -k2.4,5.6 compares starting with the fourth character of the second field and ending with the sixth character of the fifth field.

If the start of a sort key falls beyond the end of the record, then the sort key is empty, and empty sort keys sort before all nonempty ones.

When multiple -k options are given, sorting is by the first key field, and then, when records match in that key, by the second key field, and so on.

Tip

While the -k option is available on all of the systems that we tested, sort also recognizes an older field specification, now considered obsolete, where fields and character positions are numbered from zero. The key start for character m in field n is defined by + n.m, and the key end by - n.m. For example, sort +2.1 -3.2 is equivalent to sort -k3.2,4.3. If the character position is omitted, it defaults to zero. Thus, +4.0nr and +4nr mean the same thing: a numeric key, beginning at the start of the fifth field, to be sorted in reverse (descending) order.

Let's try out these options on a sample password file, sorting it by the username, which is found in the first colon-separated field:

$ sort -t: -k1,1 /etc/passwd             
               Sort by username
bin:x:1:1:bin:/bin:/sbin/nologin
chico:x:12501:1000:Chico Marx:/home/chico:/bin/bash
daemon:x:2:2:daemon:/sbin:/sbin/nologin
groucho:x:12503:2000:Groucho Marx:/home/groucho:/bin/sh
gummo:x:12504:3000:Gummo Marx:/home/gummo:/usr/local/bin/ksh93
harpo:x:12502:1000:Harpo Marx:/home/harpo:/bin/ksh
root:x:0:0:root:/root:/bin/bash
zeppo:x:12505:1000:Zeppo Marx:/home/zeppo:/bin/zsh

For more control, add a modifier letter in the field selector to define the type of data in the field and the sorting order. Here's how to sort the password file by descending UID:

$ sort -t: -k3nr /etc/passwd             
               Sort by descending UID
zeppo:x:12505:1000:Zeppo Marx:/home/zeppo:/bin/zsh
gummo:x:12504:3000:Gummo Marx:/home/gummo:/usr/local/bin/ksh93
groucho:x:12503:2000:Groucho Marx:/home/groucho:/bin/sh
harpo:x:12502:1000:Harpo Marx:/home/harpo:/bin/ksh
chico:x:12501:1000:Chico Marx:/home/chico:/bin/bash
daemon:x:2:2:daemon:/sbin:/sbin/nologin
bin:x:1:1:bin:/bin:/sbin/nologin
root:x:0:0:root:/root:/bin/bash

A more precise field specification would have been -k3nr,3 (that is, from the start of field three, numerically, in reverse order, to the end of field three), or -k3,3nr, or even -k3,3 -n -r, but sort stops collecting a number at the first nondigit, so -k3nr works correctly.

In our password file example, three users have a common GID in field 4, so we could sort first by GID, and then by UID, with:

$ sort -t: -k4n -k3n /etc/passwd         
               Sort by GID and UID
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
chico:x:12501:1000:Chico Marx:/home/chico:/bin/bash
harpo:x:12502:1000:Harpo Marx:/home/harpo:/bin/ksh
zeppo:x:12505:1000:Zeppo Marx:/home/zeppo:/bin/zsh
groucho:x:12503:2000:Groucho Marx:/home/groucho:/bin/sh
gummo:x:12504:3000:Gummo Marx:/home/gummo:/usr/local/bin/ksh93

The useful -u option asks sort to output only unique records, where unique means that their sort-key fields match, even if there are differences elsewhere. Reusing the password file one last time, we find:

$ sort -t: -k4n -u /etc/passwd           
               Sort by unique GID
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
chico:x:12501:1000:Chico Marx:/home/chico:/bin/bash
groucho:x:12503:2000:Groucho Marx:/home/groucho:/bin/sh
gummo:x:12504:3000:Gummo Marx:/home/gummo:/usr/local/bin/ksh93

Notice that the output is shorter: three users are in group 1000, but only one of them was output. We show another way to select unique records later in Section 4.2.

Sorting Text Blocks

Sometimes you need to sort data composed of multiline records. A good example is an address list, which is conveniently stored with one or more blank lines between addresses. For data like this, there is no constant sort-key position that could be used in a -k option, so you have to help out by supplying some extra markup. Here's a simple example:

$ cat my-friends                         
               Show address file
# SORTKEY: Schloß, Hans Jürgen
Hans Jürgen Schloß
Unter den Linden 78
D-10117 Berlin
Germany

# SORTKEY: Jones, Adrian
Adrian Jones
371 Montgomery Park Road
Henley-on-Thames RG9 4AJ
UK

# SORTKEY: Brown, Kim
Kim Brown
1841 S Main Street
Westchester, NY 10502
USA

The sorting trick is to use the ability of awk to handle more-general record separators to recognize paragraph breaks, temporarily replace the line breaks inside each address with an otherwise unused character, such as an unprintable control character, and replace the paragraph break with a newline. sort then sees lines that look like this:

# SORTKEY: Schloß, Hans Jürgen^ZHans Jürgen Schloß^ZUnter den Linden 78^Z...
# SORTKEY: Jones, Adrian^ZAdrian Jones^Z371 Montgomery Park Road^Z...
# SORTKEY: Brown, Kim^ZKim Brown^Z1841 S Main Street^Z...

Here, ^Z is a Ctrl-Z character. A filter step downstream from sort restores the line breaks and paragraph breaks, and the sort key lines are easily removed, if desired, with grep. The entire pipeline looks like this:

cat my-friends |                                         Pipe in address file
  awk -v RS="" { gsub("\n", "^Z"); print }' |            Convert addresses to single lines
    sort -f |                                            Sort address bundles, ignoring case
      awk -v ORS="\n\n" '{ gsub("^Z", "\n"); print }' |  Restore line structure
        grep -v '# SORTKEY'                              Remove markup lines

The gsub( ) function performs "global substitutions." It is similar to the s/x/y/g construct in sed. The RS variable is the input Record Separator. Normally, input records are separated by newlines, making each line a separate record. Using RS="" is a special case, whereby records are separated by blank lines; i.e., each block or "paragraph" of text forms a separate record. This is exactly the form of our input data. Finally, ORS is the Output Record Separator; each output record printed with print is terminated with its value. Its default is also normally a single newline; setting it here to "\n\n" preserves the input format with blank lines separating records. (More detail on these constructs may be found in Chapter 9.)

The output of this pipeline on our address file is:

Kim Brown
1841 S Main Street
Westchester, NY 10502
USA

Adrian Jones
371 Montgomery Park Road
Henley-on-Thames RG9 4AJ
UK

Hans Jürgen Schloß
Unter den Linden 78
D-10117 Berlin
Germany

The beauty of this approach is that we can easily include additional keys in each address that can be used for both sorting and selection: for example, an extra markup line of the form:

# COUNTRY: UK

in each address, and an additional pipeline stage of grep '# COUNTRY: UK' just before the sort, would let us extract only the UK addresses for further processing.

You could, of course, go overboard and use XML markup to identify the parts of the address in excruciating detail:

<address>
  <personalname>Hans Jürgen</personalname>
  <familyname>Schloß</familyname><br/>
  <streetname>Unter den Linden<streetname>
  <streetnumber>78</streetnumber><br/>
  <postalcode>D-10117</postalcode>
  <city>Berlin</city><br/>
  <country>Germany</country>
</address>

With fancier data-processing filters, you could then please your post office by presorting your mail by country and postal code, but our minimal markup and simple pipeline are often good enough to get the job done.

Sort Efficiency

The obvious way to sort data requires comparing all pairs of items to see which comes first, and leads to algorithms known as bubble sort and insertion sort. These quick-and-dirty algorithms are fine for small amounts of data, but they certainly are not quick for large amounts, because their work to sort n records grows like n 2. This is quite different from almost all of the filters that we discuss in this book: they read a record, process it, and output it, so their execution time is directly proportional to the number of records, n.

Fortunately, the sorting problem has had lots of attention in the computing community, and good sorting algorithms are known whose average complexity goes like n 3/2 (shellsort), n log n (heapsort, mergesort, and quicksort), and for restricted kinds of data, n (distribution sort). The Unix sort command implementation has received extensive study and optimization: you can be confident that it will do the job efficiently, and almost certainly better than you can do yourself without learning a lot more about sorting algorithms.

Sort Stability

An important question about sorting algorithms is whether or not they are stable: that is, is the input order of equal records preserved in the output? A stable sort may be desirable when records are sorted by multiple keys, or more than once in a pipeline. POSIX does not require that sort be stable, and most implementations are not, as this example shows:

$ sort -t_ -k1,1 -k2,2 << EOF            
               Sort four lines by first two fields
> one_two
> one_two_three
> one_two_four
> one_two_five
> EOF
one_two
one_two_five
one_two_four
one_two_three

The sort fields are identical in each record, but the output differs from the input, so sort is not stable. Fortunately, the GNU implementation in the coreutils package[1] remedies that deficiency via the —stable option: its output for this example correctly matches the input.

Sort Wrap-Up

sort certainly ranks in the top ten Unix commands: learn it well because you'll use it often. More details on sort are provided in the sidebar near the start of this chapter, but consult the manual pages for sort(1) for the complete story on your system. sort is, of course, standardized by POSIX, so it should be available on every computer that you are likely to use.

Get Classic Shell Scripting now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.