The Making of Effective awk Programming

By Arnold Robbins
November 1, 2002

Introduction

O’Reilly & Associates published the third edition of Effective awk Programming in May 2001. The book provides thorough coverage of the awk programming language, as standardized by the IEEE POSIX standard for portable operating system applications. This standard is based on Unix and its utilities. Effective awk Programming also doubles as the user’s guide for GNU awk (known as gawk), explaining the extensions and features that are unique to gawk. It includes a wealth of sample programs and library functions that demonstrate good awk programming style.

gawk is the standard version of awk on GNU/Linux and most BSD-based systems. It is also popular on commercial Unix and Windows systems because it has a number of useful extensions, and because it can handle large data sets (records with hundreds or thousands of fields, arrays with thousands of elements) that often cause other implementations to give up. The third edition of Effective awk Programming describes the current version of gawk, 3.1.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

The GNU project uses the Texinfo markup language for all of its documentation. Texinfo is a pleasant markup language in which to work. It is semantically driven: you markup what something is, not how to print it; it allows easy nesting of different constructs; it is not as painful to type as HTML or DocBook XML; and it provides for translation into multiple output formats.

Printed documents may be generated directly from Texinfo input files by using TeX. The Texinfo distribution includes the file texinfo.tex, which is a set of TeX macros that directly implement the Texinfo language, and scripts for running TeX. Other output formats are generated by the makeinfo program, which is a rather large and complicated C program that knows how to produce GNU Info, HTML, and these days, DocBook XML.

The use of Texinfo for Effective awk Programming presented a problem for O’Reilly. Their production process prefers the use of DocBook markup (particularly the XML variant) since it may be used to produce both printed and browsable versions of the same book. (Browsable versions are necessary for the CD-ROM editions of their books, as well as for the Safari Bookshelf.) Furthermore, O’Reilly has a series design used for all their books: the TeX output from texinfo.tex, while reasonable enough, doesn’t looks anything like an O’Reilly book.

By the time of the initial discussions with O’Reilly, I had produced four O’Reilly books in DocBook SGML, so I was quite comfortable with it. And as the author of the gawk.texi, I was also very comfortable with Texinfo. Therefore, because both O’Reilly and I were committed to getting Effective awk Programming published, I promised to manage the conversion from Texinfo into DocBook for the final book production.

I reasoned that since makeinfo could already produce HTML, and since HTML and DocBook are conceptually similar, it shouldn’t be that hard to modify the code to generate DocBook. I had worked with the makeinfo source code in the past, so I wasn’t scared, even if I was a bit naive.

Delaying the conversion to DocBook until the end had two other related, significant advantages. First, I was able to use the Texinfo version for the technical review, incorporating all the changes from the review into the documentation that would eventually ship with gawk. And second, O’Reilly agreed to do their copy editing on a paper copy of the Texinfo version of the manuscript. I then entered the copy edits into the Texinfo source file, again allowing the distributed version to benefit from O’Reilly’s considerable editorial expertise.

(At this point I’d like to pause and acknowledge the significant contributions made by Chuck Toporek, my editor. His comments helped to enormously improve the organization and presentation of the material in the book. Mary Sheehan’s copy edits were also very valuable. I learned a lot about good writing during the work on this book.)

Furthermore, Chuck and the rest of the people at O’Reilly bent over backwards to make sure that they complied with the GNU Free Documentation License (FDL), under which the book is published. The final DocBook XML source for the book is available from the O’Reilly Web site. The Texinfo version, of course, is part of the gawk distribution.)

Converting to DocBook

Fortunately, I didn’t have to write the DocBook changes for makeinfo from scratch. Philippe Martin had done the bulk of this already, and I was able to obtain his patches to the makeinfo source code. His code did the vast majority of what I needed.

Philippe’s version generated DocBook SGML. At the time, O’Reilly was moving away from SGML, towards the XML version of DocBook. The differences boiled down mostly to using lowercase for tags, always providing a full closing tag (<emph>whatever</emph> versus <emph>whatever</>), using the trailing-slash version of tags that don’t enclose objects (such as <xref linkend="..."/>), and fully quoting all the parameters inside of tags (<colspec colnum="1"/> vs. <colspec colnum=1>).

Also, Philippe’s code often generated a single DocBook tag for multiple different Texinfo commands, when in fact DocBook has tags that correspond to the original Texinfo commands. For example, it might produce <literal> for both @command{} and @file{}. This needed to be fixed, so that the generated output would contain separate <command> and <filename> tags. In other words, as much as possible, it was necessary to preserve the semantic-based nature of the Texinfo markup in the generated DocBook.

This work was straightforward, and over a week or two, I did the bulk of it, getting makeinfo to the point where it produced a basic DocBook XML version of gawk.texi on which I could do further post-processing.

The current release of Texinfo includes Philippe’s original changes, as well as my improvements. Philippe has gone further with the development, and besides DocBook XML, makeinfo can produce a variant of XML that uses a Texinfo DTD that is similar to the DocBook XML DTD. Indeed, most of the reformatting problems described below are no longer needed with the current version. For further details, see the Texinfo distribution.

Making Usable DocBook

Generating technically correct DocBook markup was just the beginning of the process. While the file might go through an XML parser without any problems, it would still need to be readable, so that O’Reilly’s production editors could work with it directly. It also needed to adhere to O’Reilly’s markup conventions, such as the id="..." parameter in <chapter> and section tags, and in <xref> tags for cross references. There was still a ways to go.

General Cleanups

First, the makeinfo output needed lots of simple cleanups. Some of these related to anomalies in the output, others to removing Texinfo-specific output features which were better expressed using different fonts in DocBook. The first script, fixup.awk, evolved to handle many of these. This section presents the most interesting of the changes that had to be made.

makeinfo generated some boiler-plate material at the front of the file that wasn’t necessary for O’Reilly’s DocBook tools. It looks like this:

<!-- This is /home/arnold/ORA/db/gawk.sgml, produced by makeinfo version 4.0 from gawk.texi.   --><para>
<!DOCTYPE book PUBLIC "-//Davenport//DTD DocBook V3.0//EN">
<book>
<title>The GNU Awk User's Guide</title>

</para>

Notice that the <para> and </para> tags are misplaced. This early version of makeinfo was over-zealous about wrapping things in paragraph tags. The first part of fixup.awk strips off this leading junk. It works by having the first rule look for the first <chapter> tag. When that’s seen, it sets a flag. The second rule checks the flag. If it hasn’t been seen yet, the next statement gets the next line of input:

#! /bin/gawk -f

# strip leading gunk from file
/<chapter/        { chapter_seen = 1 }
! chapter_seen    { next }

The next bit removes trailing white space (space and TAB characters) and removes leading white space inside lists and examples. The first rules uses the sub() function to unconditionally remove trailing white space. (This is needed only because I find such white space gets in the way when editing.)

The in_term variable indicates being inside the terms of a variable list. Inside list item bodies or examples, the strip_spaces variable is true (non-zero), so the sub() function removes all leading white space. The closing tags set the strip_spaces flag back to false:

# strip trailing white space
/[ \t]+$/         { sub(/[ \t]+$/, "") }

# strip leading spaces inside lists
/<listitem>/      { stripspaces++ ; in_term = 0 }
/<\/listitem>/    { stripspaces-- }

# fix up examples
/<screen>/        { in_screen++ ; stripspaces++ }

stripspaces != 0  { sub(/^ +/, "") }

/<\/screen>/      { in_screen-- ; stripspaces-- }

The Texinfo command @var{} is used to describe something that is variable, such as a user’s supplies. It corresponds to the DocBook <replaceable> tag. In an O’Reilly book, <replaceable> items get printed in a Constant Width Italic font. This is entirely appropriate in most contexts, such as within examples, or in lists where items represent a combination of a command and its parameters.

However, O’Reilly conventions indicate that variable items should be in regular italics when used in prose discussion. For example:

<!-- Correctly marked up DocBook XML -->
<variablelist>
<varlistentry><term>
<literal>ls -l</literal> <replaceable>file</replaceable>
</term>
<listentry><para>
The <command>ls</command> with the <option>-l</option> gives extra information about <emphasis>file</emphasis>.
</para></listentry>
</varlistentry>
...
</variablelist>

The generated DocBook used <replaceable> everywhere. This next bit of code makes the context-sensitive transformation for us:

O’Reilly books use a Constant Width Bold font to indicate user input in examples and a plain Constant Width font for computer output. Texinfo only uses plain Constant Width, distinguishing computer output with a leading glyph, in this case, -|. (TeX output uses a similar, but nicer looking symbol.) Error messages are prefixed with a different glyph that comes out in the DocBook file as error-->. This next bit removes these glyphs. It also supplies the <userinput> tags for any line whose first character is either $ or > (the > symbol). These represent the Bourne shell primary and secondary prompts, respectively, which are used in printed examples of interactive use:

in_screen != 0 {
    gsub(/-\| */, "");
    gsub(/error--> /, "");
    if (/^(\$|>) /)
        $0 = gensub(/ (.+)/, " <userinput>\\1</userinput>", "g")
}

The gensub() (“general substitution”) function is a gawk extension. The first argument is the regular expression to match. The second is the replacement text. The third is either a number indicating which match of the text to replace, or "g", meaning that the change should be done globally (on all matches). The fourth argument, if present, is the value of the original text. When not supplied, the current input record ($0) is used. The return value is the new text after the substitution has taken place.

Here the replacement text includes \\1, which means “use the text matched by the part of the regular expression enclosed in the first set of parentheses.” What this ends up doing is enclosing the command entered by the user in <userinput> tags, leaving the rest of the line alone.

Texinfo doesn’t have sidebars, which are blocks of text set off to the side for separate, isolated discussion of issues. They are typically used for more in depth discussion items or for longer examples. In gawk.texi, I got around the lack of sidebars by using regular sections and adding the words “Advanced Notes” to the section title. This next bit of code looks for sections that have the words “Advanced Notes” in their titles and converts them into sidebars. While it’s at it, it removes all inline font changes from the contents between <title> and </title> tags, since such font changes are against O’Reilly conventions:

# deal with Advanced Notes, turn them into sidebars /^<sect/  { save_sect = $0 ; next }

/<title>/ {
    if (/Advanced Notes/) {
        in_sidebar++
        print "<sidebar>"
        sub(/Advanced Notes: /, "")
    } else if (save_sect) {
        print save_sect
    }
    save_sect = ""

    # remove font changes from titles
    if (match($0, /<title>.+<\/title>/)) {
        before = substr($0, 1, RSTART - 1)
        text = substr($0, RSTART + 7, RLENGTH - 15)
        after = substr($0, RSTART + RLENGTH)
        gsub(/<[^>]+>/, "", text)
        print before "<title>" text "</title>" after
        next
    }
}

/<\/sect/ {
    if (in_sidebar) {
        print "</sidebar>"
        in_sidebar = 0
        next
    }
}

There are three different kinds of dashes used in typography. “Em-dashes” are the length of the letter “m.” “En-dashes” are the length of the letter “n.” They are shorter than em-dashes. And plain dashes, or hyphens, are the shortest of all. The makeinfo output represents an em-dash as two dashes. This last chunk turns them into the  DocBook entity. This change is not done inside examples (! in_screen). The very last rule simply prints the (possibly modified) input record to the output:

/([a-z]|(<\/[a-z]+>))--[a-z]/ && ! in_screen {
    $0 = gensub(/([a-z]|(<\/[a-z]+>)?)--([a-z])/, "\\1\\—\\3", g, $0)
}

{ print }

As mentioned earlier, the early DocBook version of makeinfo generated lots of unnecessary <para> tags. The output had numerous empty paragraphs, and removing them by hand was just too painful. The following simple script, rmpara.awk strips out empty paragraphs.

This script works by taking advantage of gawk‘s ability to specify a regular expression as the record separator. Here, records are separated by the markup for empty paragraphs. By setting the output record separator to the null string (ORS = ""), a print statement prints the preceding part of the file.

#! /usr/local/bin/gawk -f BEGIN {
    RS = "<para>[ \t\n]+</para>\n*"
    ORS = ""
}

And since we’re working with paragraph tags, the following small rule puts <para> tags inside lists and index entries on their own lines. This makes the DocBook file easier to work with. The final rule simply prints the record, which is all text in the file up to an empty paragraph:

/(indexterm|variablelist)><para>/ {
    sub(/<para>/, "\n&")
}

{ print }

Fixing Tables

A significant problem, requiring a separate script, had to do with the formatting of tables. The Texinfo @multitable ... @end multitable translates pretty directly into a DocBook <table>. However, the formatting of the output, while fine for machine processing, was essentially impossible for a human to work with directly. For example:

<para>
<table> <title></title> <tgroup cols="2"><colspec colwidth="31*"> <colspec colwidth="49*"> <tbody> <row>
<entry><literal>[:alnum:]</literal> </entry>
<entry> Alphanumeric characters.  </entry> </row><row> <entry>
<literal>[:alpha:]</literal> </entry> <entry> Alphabetic characters.
</entry> </row><row> <entry><literal>[:blank:]</literal>
</entry> <entry> Space and tab characters.  </entry> </row><row>
<entry> <literal>[:cntrl:]</literal> </entry> <entry> Control
characters.  </entry> </row></tbody> </tgroup> </table>
</para>

Each row in a table should be separate, and each entry (column) in a row should have its own line (or lines). For this, I wrote the next script, fixtable.awk. It is similar to the rmpara.awk script, in that it uses a regular expression for RS. This time the regular expression matches DocBook tags. Thus the record is all text up to a tag, and the record separator is the tag itself plus any trailing white space.

The associative array tab (for “table”) contains all the table-related tags that should be on their own lines. The <colspec> tag contains parameters, thus it does not have the closing > character in it:

#! /bin/gawk -f

BEGIN {
    RS = "<[^>]+> *"
    tab["<table>"] = 1
    tab["<colspec"] = 1
    tab["<tbody>"] = 1
    tab["<tgroup"] = 1
    tab["</tgroup>"] = 1
    tab["</tbody>"] = 1
    tab["<row>"] = 1
    tab["</row>"] = 1
}

gawk sets the variable RT (record terminator) to the actual text that matched the RS regular expression. Any trailing white space in RT is saved in the variable white, and then removed from RT. This is necessary in case the tag in RT isn’t one for tables. Then the white space has to be put back into the output to preserve the original file’s contents:

{
   # remove trailing white
    # gensub returns the original string if the re doesn't match
    if (RT ~ / +$/)
        white = gensub(/.*>( +$)/, "\\1", 1, RT)
    else
        white = ""
    sub(/ +$/, "", RT)

This next part does the work. It splits RT around white space. (This is necessary for the <colspec> tag.) If the tag is in the table, we print the preceding record, a newline, and then the whole tag on its own line. <entry> tags are printed on their own lines. Finally, any other tags are printed together with the preceding record, without intervening newlines, and with the original trailing white space:

    split(RT, a, " ")
    if (a[1] in tab)
        printf ("%s\n%s\n", $0, RT)
    else if (a[1] == "<entry>")
        printf ("%s\n%s", $0, RT)
    else
        printf ("%s%s", $0, RT white)
}

The result of running this script on the above input is:

<para>

<table>
<title></title>

<tgroup cols="2">
<colspec colwidth="31*">
<colspec colwidth="49*">

<tbody>

<row>
<entry><literal>[:alnum:]</literal> </entry>
<entry>Alphanumeric characters.  </entry>
</row>

<row>
<entry><literal>[:alpha:]</literal> </entry>
<entry>Alphabetic characters.  </entry>
</row>

<row>
<entry><literal>[:blank:]</literal> </entry>
<entry>Space and tab characters.  </entry>
</row>

<row>
<entry><literal>[:cntrl:]</literal> </entry>
<entry>Control characters.  </entry>
</row>

</tbody>
</tgroup>
</table>
</para>

Although there are still extra newlines, at least now the table is readable, and further manual cleaning up isn’t difficult.

Fixing Index Entries

The next task was to work on the indexing entries. The original gawk.texi file already had a number of index entries that I had placed there. makeinfo translated them into DocBook <indexterm> entries, but they still needed some work. For example, occasionally additional material appeared on the same line as the closing </indexterm> tag. More importantly, special characters in the text of an index entry, such as < and >, were not turned into < and > in the generated DocBook. Also, O’Reilly’s convention is to not have any font changes in the contents of an index entry. The fixindex.awk script dealt with all of these. The first part handles splitting off any trailing text:

#! /bin/gawk -f

# <indexterm> always comes at the beginning of a line.
# 1. If there's anything after the </indexterm>, insert a newline
# 2. Remove markup in the indexed items

/<indexterm>/   {
    if (match($0, /<\/indexterm>./)) {
        front = substr($0, 1, RSTART + 11);
        rest = substr($0, RSTART + RLENGTH - 1)
    } else {
        front = $0
        rest = ""
    }

If the text of the index entry has font changes in it, the next part extracts the contents of the entry, removes the font changes, and then puts the tags back in:

    if (match(front, /<(literal|command|filename)>/)) {
        text = gensub(/<indexterm>(.+)<\/indexterm>/, "\\1", 1, front)
        gsub(/<\/?(literal|command|filename)>/, "", text)
        front = "<indexterm>" text "</indexterm>"
    }

Looking at this now, sometime later, I see that the removal and restoration of the <indexterm> tags isn’t necessary. Nevertheless, I leave it here to show the code as I wrote it then.

The rest of the rule deals with index entries for the <<=>, and >= operators, converting them into the appropriate DocBook entities. Finally, it prints the modified line and any trailing material that may have been present, and then gets the next input line with next. The final rule simply prints lines that aren’t indexing lines:

    gsub(/><=/, ">\\<=", front)
    gsub(/>< /, ">\\< ", front)
    gsub(/>>=/, ">\\>=", front)
    gsub(/>> /, ">\\> ", front)
    print front
    if (rest)
        print rest
        next
}

{ print }

Fixing Options

As you may have noticed, the scripts have been progressing from larger-scope fixes to smaller-scope fixes. This next script deals with a fine-grained, typographical detail.

In the Italic font O’Reilly uses to represent options, the correct character to use for a hyphen or dash is the en-dash, discussed earlier. This is represented by the DocBook &ndash; entity. Furthermore, gawk‘s long options start with two dashes, not one. In both the Italic font in the text and in the Roman font in the index, the two dashes run together when printed, making them difficult to distinguish.

This next script solves both problems. It converts plain dash characters to &ndash;, and inserts an &thinsp; character between two en-dashes. The &thinsp; is a very small amount of horizontal spacing whose job is to provide just such tiny amounts of separation between characters. This script also works by setting RS to a regular expression matching the text of interest, modifying the capture value in RT, and then printing the record and new text back out.

The <primary> and <secondary> tags only appear inside <indexterm> tags. The <option> tags delimit options in the book’s main text:

#! /bin/awk -f

BEGIN {
    RS = "<(primary|secondary|option)>-(-|[A-Za-z])+"
}

{
    if (RT != "") {
        new = RT
        new = gensub(/--/, "\\–\\ \\–", "g", new)
        new = gensub(/-/, "\\–", "g", new)
    } else
        new = ""
    printf("%s%s", $0, new)
}

Manual Work

After going through all the above scripts, the book was almost ready for prime time. All my scripts had produced a DocBook XML document that was quite close to what I would have produced had I been entering the text directly in DocBook. It took considerably less effort than if I tried to convert the text from Texinfo to DocBook using either the sed stream editor, or manually, using editor commands (the colon prompt within vim).

Nevertheless, my Notes file lists a fair number of manual changes that I had to make, things that weren’t amenable to scripting. Most of these, though, could be tackled using the vim command line. (Believe me, if I could have fixed these with a script too, I would have. But sometimes there are things that a program just isn’t quite smart enough to handle.)

After all of these changes, I was at the final stage. In fact, this was during the technical review stage, and for a brief while before submitting the book to O’Reilly’s Production department, I was making edits in parallel, in both the Texinfo and the DocBook versions of the book. The main reason for this was to avoid having to remake all the manual edits. It was easier to make a few incremental changes in parallel than to just edit the Texinfo file, regenerate DocBook, and then have to redo all the manual edits.

Fixing Identifiers

One final transformation was needed before submitting the book to Production. O’Reilly has a standard convention for naming chapters, sections, tables, and figures within the id="..." clause of the appropriate tags. For example, <sect2 id="eap3-ch-3-sect-2.1">. These same identifiers are used in <xref> tags for cross references.

However, makeinfo produced identifiers based on the original names of the @node lines in the gawk.texi file. For example, <sect1 id="How20To20Contribute">. (Here, the spaces in the original node name are replaced by 20, which is the numeric value of the space character, in hexadecimal.) I needed to transform these generated identifiers into ones that followed the O’Reilly convention.

The following script, redoids.awk (re-do ids), does this job. It makes two passes over the input. The first pass extracts the existing ids from chapter, section, and table tags. It maintains the appropriate chapter and section level counts, and by using them, generates the correct new tag for the given item. The first pass builds up a table (an associative array), mapping the old ids to the new ones.

The second pass goes through the file, actually making the substitutions of new id for old. It can’t be done all in one pass since there are cross references, both forwards and backwards, scattered throughout the text.

Setting Up Two Passes

The BEGIN block checks that exactly one argument was given, and prints an error message if not. It then sets some global variables, namely, the book name and IGNORECASE, which causes gawk to ignore case when doing regular expression matching:

#! /bin/gawk -f

BEGIN {
    if (ARGC != 2) {
        print("usage: redoids file > newfile\n") > "/dev/stderr"
        abnormal = 1
        exit 1
    }

    book = "eap3"
    IGNORECASE = 1

This next part actually sets up two passes over the input. It first initializes Pass to 1. Next, it adds a variable assignment, Pass=2, to ARGV, and then the input filename, and increments ARGC.

The upshot is that gawk reads through the file twice, with the variable Pass being set appropriately each time through. The code for the two passes then distinguishes which pass is which by testing Pass:

    # set up two passes
    Pass = 1
    ARGV[ARGC++] = "Pass=2"
    ARGV[ARGC++] = ARGV[1]
}

The First Pass

Top level section headings within a chapter are often referred to in publishing as “A-level headings,” or just “A heads” for short. Similarly, the next level section headings are “B heads,” “C heads,” and so on. The variables ahbhch, and dh, represent heading levels. At each level, the variable for the levels below it must be set to zero. The variable tab represents the current table number within a chapter. The chnum variable tracks the current chapter. Thus, this first rule sets all the variables to zero, extracts the current id, and computes a new one:

Pass == 1 && /^<chapter/ {
    ah = bh = ch = dh = tab = 0
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    curchap = sprintf("ch-%d", ++chnum)
    newtag = sprintf("%s-%s", book, curchap)
    tags[oldid] = newtag
}

The next few rules are similar, and handle chapter-level items that aren’t actually chapters:

Pass == 1 && /^<preface/ {
    ah = bh = ch = dh = tab = 0
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    curchap = "ch-0"
    newtag = sprintf("%s-%s", book, curchap)
    tags[oldid] = newtag
}

Pass == 1 && /^<appendix/ {
    ah = bh = ch = dh = tab = 0
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    applet = substr("abcdefghijklmnopqrstuvwxyz", ++appnum, 1)
    curchap = sprintf("ap-%s", applet)
    newtag = sprintf("%s-%s", book, curchap)
    tags[oldid] = newtag
}

Pass == 1 && /^<glossary/ {
    ah = bh = ch = dh = tab = 0
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    curchap = "glossary"
    newtag = sprintf("%s-%s", book, curchap)
    tags[oldid] = newtag
}

Next comes code that deals with section tags. The first rule handles a special case. Two of the appendixes in Effective awk Programming are the GNU General Public License (GPL), which covers the gawk source code and the GNU Free Documentation License (FDL), which covers the book itself. The sections in these appendixes don’t have ids, nor do they need them. The first rule skips them.

The second rule does much of the real work. It extracts the old id, and then it extracts the level of the section (1, 2, 3, etc.). Based on the level, it resets the lower-level heading variables and sets up the new id.

The third rule handles tables. Table numbers increase monotonically through the whole chapter and have two-digit numbers:

Pass == 1 && /<sect[1-4]>/ { next }     # skip licenses

Pass == 1 && /^<sect[1-4]/ {
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    level = substr($1, 6, 1) + 0    # get level
    if (level == 1) {
        ++ah
        sectnum = ah
        bh = ch = dh = 0
    } else if (level == 2) {
        ++bh
        sectnum = ah "." bh
        ch = dh = 0
    } else if (level == 3) {
        ++ch
        sectnum = ah "." bh "." ch
        dh = 0
    } else {
        ++dh
        sectnum = ah "." bh "." ch "." dh
    }
    newtag = sprintf("%s-%s-sect-%s", book, curchap, sectnum)
    tags[oldid] = newtag
}

Pass == 1 && /^<table/ {
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    newtag = sprintf("%s-%s-tab-%02d", book, curchap, ++tab)
    tags[oldid] = newtag
}

The Second Pass

By using -v Debug=1 on the gawk command line, I could do debugging of the code that gathered old ids and built new ones. When debugging is true, the program simply skips the second pass, by reading through the file and doing nothing. More debug code appears in the END rule, below:

Pass == 2 && Debug { next }

If not debugging, this next rule is what replaces old ids in various tags with the new one:

Pass == 2 && /^<(chapter|preface|appendix|glossary|sect[1-4]|table)/ {
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    tagtype = gensub(/<(chapter|preface|appendix|glossary|sect[1-4]|table).*/, "\\1", 1, $0)
    printf "<%s id=\"%s\">\n", tagtype, tags[oldid]
    next
}

The following rule updates cross references. Cross-reference tags contain a linkend="..." clause pointing to the id of the place they reference. Since I knew that linkend= only appeared in cross references, that was all I had to look for. The while loop handles multiple cross references in a single line. The loop body works by splitting apart the line into three pieces: the part before the linkend=, the linkend clause itself, and the rest of the line after it. It then builds up the output line by concatenating the preceding text with the new linkend clause:

Pass == 2 && /linkend=/ {
    str = $0
    out = ""
    while (match(str, /linkend="[^"]+"/)) {
        before = substr(str, 1, RSTART - 1)
        xreftag = substr(str, RSTART, RLENGTH)
        after = substr(str, RSTART + RLENGTH)
        oldid = gensub(/linkend="([^"]+)"/, "\\1", 1, xreftag)
        if (! oldid in tags) {
            printf("warning: xref to %s not in tags!\n", oldid) > "/dev/stderr"
            tags[oldid] = "UNKNOWNLINK"
        }
        out = out before "linkend=\"" tags[oldid] "\""
        str = after
    }
    if (str)
        out = out str
    print out
    next
}

Finally, the last rule is the catch-all that prints out lines that don’t need updating:

Pass == 2 { print }

The END Rule

The END rule does simple cleanup. The abnormal variable is true if the wrong number of arguments were provided. The if statement tests it and exits immediately if it’s true, avoiding execution of the rest of the rule.

It turns out that the rest of the rule isn’t that involved. It simply dumps the table mapping of the old ids to the new ones if debugging is turned on:

END {
    if (abnormal)
        exit 1
    if (Debug) {
        for (i in tags)
            printf "%s -> %s\n", i, tags[i]
        exit
    }
}

Production and Post-Production

Once the new ids were in place, that was it. Since the O’Reilly DocBook tools work on separate per-chapter files, all that remained was to split the large file up into separate files, and then print them. I verified that everything went through their tools with no problems, and submitted the files to Production.

Production went quite quickly. A large part of this was due to the fact that copy editing had already been done on the Texinfo version. Usually it’s done as part of the production cycle.

O’Reilly published the book, and I released gawk 3.1.0 at about the same time. The gawk.texi shipped with gawk included all of O’Reilly’s editorial input.

It would seem that all ended happily. Alas, this was mostly true, but one non-trivial problem remained.

A major aspect of book production done after the author submits his files is indexing. While gawk.texi contained a number of index entries, most of which I had provided, this served only as an initial basis upon which to build. Indexing is a separate art, requiring training and experience to do well, and I make no pretensions that I’m good at it.

Nancy Crumpton, a professional indexer, turned my amateur index into a real one. Also, during final production, there were the few, inevitable changes made to the text to fix gaffes in English grammar or to improve the book’s layout.

I was thus left with a quandary. While the vast majority of O’Reilly’s editorial input had been used to improve the Texinfo version of the book, there were now a number of new changes that existed only in the DocBook version. I really wanted to have those included in the Texinfo version as well.

The solution involved one more script and a fair amount of manual work. The following script, desgml.awk, removes DocBook markup from a file, leaving just the text. The BEGIN block sets up a table of translations from DocBook entities to simple textual equivalents. (Some of these entities are specific to Effective awk Programming.) The specials array handles tags that must be special-cased (as opposed to entities):

#! /bin/awk -f BEGIN {
    entities["darkcorner"]  = "(d.c.)"
    entities["TeX"] = "TeX"
    entities["LaTeX"]   = "LaTeX"
    entities["BIBTeX"]  = "BIBTeX"
    entities["vellip"]  = "\n\t.\n\t.\n\t.\n"
    entities["hellip"]  = "..."
    entities["lowbar"]  = "_"
    entities["frac18 "] = "1/8"
    entities["frac38 "] = "3/8"
    # > 300 entities removed for brevity ...

    specials["<?lb?>"] = specials["<?lb>"] = " "
    specials["<keycap>"] = " "

    RS = "<[^>]+>"
    entity = "&[^;&]+;"
}

As in many of the other scripts seen so far, this one also uses RS as a regular expression that matches tags with the variable entity encapsulating the regular expression for an entity.

The single rule processes records, looking for entities to replace. The first part handles the simple case where there are no entities (match() returns zero). In such a case, all that’s necessary is to check the tag for special cases:

{
    if (match($0, entity) == 0) {
        printf "%s", $0
        special_case()
        next
    }

The next part handles replacing entities, again using a loop to pull the line apart around the text of the entity. If the entity exists in the table, it’s replaced. Otherwise it’s used as-is, minus the & and ; characters:

    # have a match
    text = $0
    out = ""
    do {
        front = substr(text, 1, RSTART - 1)
        object = substr(text, RSTART+1, RLENGTH-2)  # strip & and ;
        rest = substr(text, RSTART + RLENGTH)
        if (object in entities)
            replace = entities[object]
        else
            replace = object
        out = out front replace
        text = rest
    } while (match(text, entity) != 0)
    if (length(text) > 0)
        out = out text
    printf("%s", out)
    special_case()
}

The special_case() function translates any special tags into white space and handles cross references, replacing them with an upper-case version of the id:

function special_case(  rt, ref)
{
    # a few special cases
    rt = tolower(RT)
    if (rt in specials) {
        printf "%s", specials[rt]
    } else if (rt ~ /<xref/) {
        ref = gensub(/<xref +linkend="([^"]*)".*>/,"\\1", 1, rt)
        ref = toupper(ref)
        printf "%s", ref
    }
}

I ran both my original XML files and O’Reilly’s final XML files through the desgml.awk script to produce text versions of each chapter. I then used diff to produce a context-diff of the chapters, and went through each diff looking for indexing and wording changes. Each such change I then added back into gawk.texi. This process occurred over the course of several weeks, as it was tedious and time-consuming.

However, the end result is that gawk.texi is now once again the “master version” of the documentation, and whenever work starts on the fourth edition of Effective awk Programming, I expect to be able to generate new DocBook XML files that still contain all the work that O’Reilly contributed.

Conclusion and Acknowledgements

Translating something the size of a whole book from Texinfo to DocBook was certainly a challenge. Using gawk made the cleanup work fairly straightforward, so I was able to concentrate on revising the contents of the book without worrying too much about the production. Furthermore, the use of Texinfo did not impede the book’s production since O’Reilly received DocBook XML files that went through their tool suite, and the distributed version of the documentation benefited enormously from their input.

I would like to thank Philippe Martin for his original DocBook changes and Karl Berry, Texinfo’s maintainer, for his help and support. Many thanks go to Chuck Toporek and the O’Reilly production staff. Working with them on Effective awk Programming really was a pleasure. Thanks to Nelson H.F. Beebe, Karl Berry, Len Muellner, and Jim Meyering as well as O’Reilly folk Betsy Waliszewski, Bruce Stewart, and Tara McGoldrick for reviewing preliminary drafts of this article.

Post topics: Open Source
Share: