O'Reilly logo

Effective awk Programming, 4th Edition by Arnold Robbins

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Preface

Arnold Robbins

Nof Ayalon
Israel

Several kinds of tasks occur repeatedly when working with text files. You might want to extract certain lines and discard the rest. Or you may need to make changes wherever certain patterns appear, but leave the rest of the file alone. Such jobs are often easy with awk. The awk utility interprets a special-purpose programming language that makes it easy to handle simple data-reformatting jobs.

The GNU implementation of awk is called gawk; if you invoke it with the proper options or environment variables, it is fully compatible with the POSIX[1] specification of the awk language and with the Unix version of awk maintained by Brian Kernighan. This means that all properly written awk programs should work with gawk. So most of the time, we don’t distinguish between gawk and other awk implementations.

Using awk you can:

  • Manage small, personal databases

  • Generate reports

  • Validate data

  • Produce indexes and perform other document-preparation tasks

  • Experiment with algorithms that you can adapt later to other computer languages

In addition, gawk provides facilities that make it easy to:

  • Extract bits and pieces of data for processing

  • Sort data

  • Perform simple network communications

  • Profile and debug awk programs

  • Extend the language with functions written in C or C++

This book teaches you about the awk language and how you can use it effectively. You should already be familiar with basic system commands, such as cat and ls,[2] as well as basic shell facilities, such as input/output (I/O) redirection and pipes.

Implementations of the awk language are available for many different computing environments. This book, while describing the awk language in general, also describes the particular implementation of awk called gawk (which stands for “GNU awk”). gawk runs on a broad range of Unix systems, ranging from Intel-architecture PC-based computers up through large-scale systems. gawk has also been ported to Mac OS X, Microsoft Windows (all versions), and OpenVMS.[3]

History of awk and gawk

The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977 at AT&T Bell Laboratories. In 1985, a new version made the programming language more powerful, introducing user-defined functions, multiple input streams, and computed regular expressions. This new version became widely available with Unix System V Release 3.1 (1987). The version in System V Release 4 (1989) added some new features and cleaned up the behavior in some of the “dark corners” of the language. The specification for awk in the POSIX Command Language and Utilities standard further clarified the language. Both the gawk designers and the original awk designers at Bell Laboratories provided feedback for the POSIX specification.

Paul Rubin wrote gawk in 1986. Jay Fenlason completed it, with advice from Richard Stallman. John Woods contributed parts of the code as well. In 1988 and 1989, David Trueman, with help from me, thoroughly reworked gawk for compatibility with the newer awk. Circa 1994, I became the primary maintainer. Current development focuses on bug fixes, performance improvements, standards compliance, and, occasionally, new features.

In May 1997, Jürgen Kahrs felt the need for network access from awk, and with a little help from me, set about adding features to do this for gawk. At that time, he also wrote the bulk of TCP/IP Internetworking with gawk (a separate document, available as part of the gawk distribution). His code finally became part of the main gawk distribution with gawk version 3.1.

John Haque rewrote the gawk internals, in the process providing an awk-level debugger. This version became available as gawk version 4.0 in 2011.

See Major Contributors to gawk for a full list of those who have made important contributions to gawk.

A Rose by Any Other Name

The awk language has evolved over the years. Full details are provided in Appendix A. The language described in this book is often referred to as “new awk.” By analogy, the original version of awk is referred to as “old awk.”

On most current systems, when you run the awk utility you get some version of new awk.[4] If your system’s standard awk is the old one, you will see something like this if you try the test program:

$ awk 1 /dev/null
error→ awk: syntax error near line 1
error→ awk: bailing out near line 1

In this case, you should find a version of new awk, or just install gawk!

Throughout this book, whenever we refer to a language feature that should be available in any complete implementation of POSIX awk, we simply use the term awk. When referring to a feature that is specific to the GNU implementation, we use the term gawk.

Using This Book

The term awk refers to a particular program as well as to the language you use to tell this program what to do. When we need to be careful, we call the language “the awk language,” and the program “the awk utility.” This book explains both how to write programs in the awk language and how to run the awk utility. The term “awk program” refers to a program written by you in the awk programming language.

Primarily, this book explains the features of awk as defined in the POSIX standard. It does so in the context of the gawk implementation. While doing so, it also attempts to describe important differences between gawk and other awk implementations. Finally, it notes any gawk features that are not in the POSIX standard for awk.

This book has the difficult task of being both a tutorial and a reference. If you are a novice, feel free to skip over details that seem too complex. You should also ignore the many cross-references; they are for the expert user and for the online Info and HTML versions of the book.

There are sidebars scattered throughout the book. They add a more complete explanation of points that are relevant, but not likely to be of interest on first reading.

Most of the time, the examples use complete awk programs. Some of the more advanced sections show only the part of the awk program that illustrates the concept being described.

Although this book is aimed principally at people who have not been exposed to awk, there is a lot of information here that even the awk expert should find useful. In particular, the description of POSIX awk and the example programs in Chapter 10 and Chapter 11 should be of interest.

This book is split into several parts, as follows:

  • Part I, describes the awk language and the gawk program in detail. It starts with the basics, and continues through all of the features of awk. It contains the following chapters:

    • Chapter 1, Getting Started with awk, provides the essentials you need to know to begin using awk.

    • Chapter 2, Running awk and gawk, describes how to run gawk, the meaning of its command-line options, and how it finds awk program source files.

    • Chapter 3, Regular Expressions, introduces regular expressions in general, and in particular the flavors supported by POSIX awk and gawk.

    • Chapter 4, Reading Input Files, describes how awk reads your data. It introduces the concepts of records and fields, as well as the getline command. I/O redirection is first described here. Network I/O is also briefly introduced here.

    • Chapter 5, Printing Output, describes how awk programs can produce output with print and printf.

    • Chapter 6, Expressions, describes expressions, which are the basic building blocks for getting most things done in a program.

    • Chapter 7, Patterns, Actions, and Variables, describes how to write patterns for matching records, actions for doing something when a record is matched, and the predefined variables awk and gawk use.

    • Chapter 8, Arrays in awk, covers awk’s one and only data structure: the associative array. Deleting array elements and whole arrays is described, as well as sorting arrays in gawk. The chapter also describes how gawk provides arrays of arrays.

    • Chapter 9, Functions, describes the built-in functions awk and gawk provide, as well as how to define your own functions. It also discusses how gawk lets you call functions indirectly.

  • Part II, shows how to use awk and gawk for problem solving. There is lots of code here for you to read and learn from. This part contains the following chapters:

    Reading these two chapters allows you to see awk solving real problems.

  • Part III, focuses on features specific to gawk. It contains the following chapters:

  • Part IV, provides the following appendices, including the GNU General Public License:

    • Appendix A, describes how the awk language has evolved since its first release to the present. It also describes how gawk has acquired features over time.

    • Appendix B, describes how to get gawk, how to compile it on POSIX-compatible systems, and how to compile and use it on different non-POSIX systems. It also describes how to report bugs in gawk and where to get other freely available awk implementations.

    • Appendix C, presents the license that covers the gawk source code.

The version of this book distributed with gawk contains additional appendices and other end material. To save space, we have omitted them from the printed edition. You may find them online, as follows:

  • The appendix on implementation notes describes how to disable gawk’s extensions, how to contribute new code to gawk, where to find information on some possible future directions for gawk development, and the design decisions behind the extension API.

  • The appendix on basic concepts provides some very cursory background material for those who are completely unfamiliar with computer programming.

  • The glossary defines most, if not all, of the significant terms used throughout the book. If you find terms that you aren’t familiar with, try looking them up here.

  • The GNU FDL is the license that covers this book.

Some of the chapters have exercise sections; these have also been omitted from the print edition but are available online.

Typographical Conventions

This book is written in Texinfo, the GNU documentation formatting language. A single Texinfo source file is used to produce both the printed and online versions of the documentation. Because of this, the typographical conventions are slightly different than in other books you may have read.

Examples you would type at the command line are preceded by the common shell primary and secondary prompts, ‘$’ and ‘>’. Input that you type is shown like this. Output from the command, usually its standard output, appears like this. Error messages and other output on the command’s standard error are preceded by the glyph “error→”. For example:

$ echo hi on stdout
hi on stdout
$ echo hello on stderr 1>&2
error→ hello on stderr

In the text, almost anything related to programming, such as command names, variable and function names, and string, numeric and regexp constants appear in this font. Code fragments appear in the same font and quoted, ‘like this’. Things that are replaced by the user or programmer appear in this font. Options look like this: -f. Filenames are indicated like this: /path/to/ourfile. The first occurrence of a new term is usually its definition and appears in the same font as the previous occurrence of “definition” in this sentence.

Characters that you type at the keyboard look like this. In particular, there are special characters called “control characters.” These are characters that you type by holding down both the CONTROL key and another key, at the same time. For example, a Ctrl-d is typed by first pressing and holding the CONTROL key, next pressing the d key, and finally releasing both keys.

For the sake of brevity, throughout this book, we refer to Brian Kernighan’s version of awk as “BWK awk.” (See Other Freely Available awk Implementations for information on his and other versions.)

Note

Notes of interest look like this.

Caution

Cautionary or warning notes look like this.

Dark Corners

Dark corners are basically fractal—no matter how much you illuminate, there’s always a smaller but darker one.

Brian Kernighan

Until the POSIX standard (and Effective awk Programming), many features of awk were either poorly documented or not documented at all. Descriptions of such features (often called “dark corners”) are noted in this book with “(d.c.).”

But, as noted by the opening quote, any coverage of dark corners is by definition incomplete.

Extensions to the standard awk language that are supported by more than one awk implementation are marked “(c.e.)” for “common extension.”

The GNU Project and This Book

The Free Software Foundation (FSF) is a nonprofit organization dedicated to the production and distribution of freely distributable software. It was founded by Richard M. Stallman, the author of the original Emacs editor. GNU Emacs is the most widely used version of Emacs today.

The GNU[5] Project is an ongoing effort on the part of the Free Software Foundation to create a complete, freely distributable, POSIX-compliant computing environment. The FSF uses the GNU General Public License (GPL) to ensure that its software’s source code is always available to the end user. The GPL applies to the C language source code for gawk. To find out more about the FSF and the GNU Project online, see the GNU Project’s home page. This book may also be read from GNU’s website.

The book you are reading is actually free—at least, the information in it is free to anyone. The machine-readable source code for the book comes with gawk.

The book itself has gone through multiple previous editions. Paul Rubin wrote the very first draft of The GAWK Manual; it was around 40 pages long. Diane Close and Richard Stallman improved it, yielding a version that was around 90 pages and barely described the original, “old” version of awk.

I started working with that version in the fall of 1988. As work on it progressed, the FSF published several preliminary versions (numbered 0.x). In 1996, edition 1.0 was released with gawk 3.0.0. The FSF published the first two editions under the title The GNU Awk User’s Guide. SSC published two editions of the book under the title Effective awk Programming, and O’Reilly published the third edition in 2001.

This edition maintains the basic structure of the previous editions. For FSF edition 4.0, the content was thoroughly reviewed and updated. All references to gawk versions prior to 4.0 were removed. Of significant note for that edition was the addition of Chapter 14.

For FSF edition 4.1 (the fourth edition as published by O’Reilly), the content has been reorganized into parts, and the major new additions are Chapter 15 and Chapter 16.

This book will undoubtedly continue to evolve. If you find an error in the book, please report it! See Reporting Problems and Bugs for information on submitting problem reports electronically.

How to Stay Current

You may have a newer version of gawk than the one described here. To find out what has changed, you should first look at the NEWS file in the gawk distribution, which provides a high-level summary of the changes in each release.

You can then look at the online version of this book to read about any new features.

Using Code Examples

This book is here to help you get your job done. Most of the example programs in this book come in the gawk distribution and are marked in the files as being in the public domain. So, in general, you may use the code in this book in your programs and documentation. Incorporating a significant amount of prose or example code from this book into your product’s documentation requires compliance with the GNU FDL.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Effective awk Programming, Fourth Edition, by Arnold Robbins (O’Reilly). Copyright 2015 Free Software Foundation, 978-1-491-90461-9.”

If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at .

Safari® Books Online

Note

Safari Books Online (www.safaribooksonline.com) is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.

Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/effective-awk-programming-4e.

To comment or ask technical questions about this book, send email to .

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

The initial draft of The GAWK Manual had the following acknowledgments:

Many people need to be thanked for their assistance in producing this manual. Jay Fenlason contributed many ideas and sample programs. Richard Mlynarik and Robert Chassell gave helpful comments on drafts of this manual. The paper A Supplemental Document for awk by John W. Pierce of the Chemistry Department at UC San Diego, pinpointed several issues relevant both to awk implementation and to this manual, that would otherwise have escaped us.

I would like to acknowledge Richard M. Stallman, for his vision of a better world and for his courage in founding the FSF and starting the GNU Project.

The previous edition of this book had the following acknowledgments:

The following people (in alphabetical order) provided helpful comments on various versions of this book: Rick Adams, Dr. Nelson H.F. Beebe, Karl Berry, Dr. Michael Brennan, Rich Burridge, Claire Cloutier, Diane Close, Scott Deifik, Christopher (“Topher”) Eliot, Jeffrey Friedl, Dr. Darrel Hankerson, Michal Jaegermann, Dr. Richard J. LeBlanc, Michael Lijewski, Pat Rankin, Miriam Robbins, Mary Sheehan, and Chuck Toporek.

Robert J. Chassell provided much valuable advice on the use of Texinfo. He also deserves special thanks for convincing me not to title this book How to Gawk Politely. Karl Berry helped significantly with the TeX part of Texinfo.

I would like to thank Marshall and Elaine Hartholz of Seattle and Dr. Bert and Rita Schreiber of Detroit for large amounts of quiet vacation time in their homes, which allowed me to make significant progress on this book and on gawk itself.

Phil Hughes of SSC contributed in a very important way by loaning me his laptop GNU/Linux system, not once, but twice, which allowed me to do a lot of work while away from home.

David Trueman deserves special credit; he has done a yeoman job of evolving gawk so that it performs well and without bugs. Although he is no longer involved with gawk, working with him on this project was a significant pleasure.

The intrepid members of the GNITS mailing list, and most notably Ulrich Drepper, provided invaluable help and feedback for the design of the internationalization features.

Chuck Toporek, Mary Sheehan, and Claire Cloutier of O’Reilly & Associates contributed significant editorial help for this book for the 3.1 release of gawk.

Dr. Nelson Beebe, Andreas Buening, Dr. Manuel Collado, Antonio Colombo, Stephen Davies, Scott Deifik, Akim Demaille, Darrel Hankerson, Michal Jaegermann, Jürgen Kahrs, Stepan Kasal, John Malmberg, Dave Pitts, Chet Ramey, Pat Rankin, Andrew Schorr, Corinna Vinschen, and Eli Zaretskii (in alphabetical order) make up the current gawk “crack portability team.” Without their hard work and help, gawk would not be nearly the robust, portable program it is today. It has been and continues to be a pleasure working with this team of fine people.

Notable code and documentation contributions were made by a number of people. See Major Contributors to gawk for the full list.

Thanks to Andy Oram of O’Reilly Media for initiating the fourth edition and for his support during the work. Thanks to Jasmine Kwityn for her copyediting work.

Thanks to Michael Brennan for the Forewords.

Thanks to Patrice Dumas for the new makeinfo program. Thanks to Karl Berry, who continues to work to keep the Texinfo markup language sane.

Robert P.J. Day, Michael Brennan, and Brian Kernighan kindly acted as reviewers for the 2015 edition of this book. Their feedback helped improve the final work.

I would also like to thank Brian Kernighan for his invaluable assistance during the testing and debugging of gawk, and for his ongoing help and advice in clarifying numerous points about the language. We could not have done nearly as good a job on either gawk or its documentation without his help.

Brian is in a class by himself as a programmer and technical author. I have to thank him (yet again) for his ongoing friendship and for being a role model to me for close to 30 years! Having him as a reviewer is an exciting privilege. It has also been extremely humbling...

I must thank my wonderful wife, Miriam, for her patience through the many versions of this project, for her proofreading, and for sharing me with the computer. I would like to thank my parents for their love, and for the grace with which they raised and educated me. Finally, I also must acknowledge my gratitude to G-d, for the many opportunities He has sent my way, as well as for the gifts He has given me with which to take advantage of those opportunities.



[1] The 2008 POSIX standard is accessible online.

[2] These utilities are available on POSIX-compliant systems, as well as on traditional Unix-based systems. If you are using some other operating system, you still need to be familiar with the ideas of I/O redirection and pipes.

[3] Some other, obsolete systems to which gawk was once ported are no longer supported and the code for those systems has been removed.

[4] Only Solaris systems still use an old awk for the default awk utility. A more modern awk lives in /usr/xpg6/bin on these systems.

[5] GNU stands for “GNU’s Not Unix.”

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required