Several kinds of tasks occur repeatedly when working with text files.
You might want to extract certain lines and discard the rest. Or you may
need to make changes wherever certain patterns appear, but leave the rest of
the file alone. Such jobs are often easy with
awk utility interprets a special-purpose programming
language that makes it easy to handle simple data-reformatting jobs.
The GNU implementation of
awk is called
gawk; if you invoke it with the proper options or environment variables, it is
fully compatible with the POSIX specification of the
awk language and with
the Unix version of
awk maintained by Brian Kernighan.
This means that all properly written
awk programs should
gawk. So most of the time, we don’t distinguish
gawk and other
awk you can:
Manage small, personal databases
Produce indexes and perform other document-preparation tasks
Experiment with algorithms that you can adapt later to other computer languages
Extract bits and pieces of data for processing
Perform simple network communications
Profile and debug
Extend the language with functions written in C or C++
This book teaches you about the
awk language and
how you can use it effectively. You should already be familiar with basic system commands,
ls, as well as basic shell facilities, such as input/output (I/O)
redirection and pipes.
Implementations of the
awk language are available
for many different computing environments. This book, while describing the
language in general, also describes the particular implementation of
gawk (which stands for “GNU
gawk runs on a broad range of
Unix systems, ranging from Intel-architecture PC-based computers up through
gawk has also been ported to Mac OS
X, Microsoft Windows (all versions), and OpenVMS.
awk comes from the initials of its designers: Alfred V. Aho, Peter J.
Weinberger, and Brian W. Kernighan. The original version of
awk was written
in 1977 at AT&T Bell Laboratories. In 1985, a new version made the
programming language more powerful, introducing user-defined functions,
multiple input streams, and computed regular expressions. This new version
became widely available with Unix System V Release 3.1 (1987). The version
in System V Release 4 (1989) added some new features and cleaned up the
behavior in some of the “dark corners” of the language. The specification
awk in the POSIX Command Language and Utilities
standard further clarified the language. Both the
designers and the original
awk designers at Bell
Laboratories provided feedback for the POSIX specification.
Paul Rubin wrote
gawk in 1986. Jay Fenlason
completed it, with advice from Richard Stallman. John Woods contributed
parts of the code as well. In 1988 and 1989, David Trueman, with help from
me, thoroughly reworked
gawk for compatibility with the
awk. Circa 1994, I became the primary maintainer.
Current development focuses on bug fixes, performance improvements,
standards compliance, and, occasionally, new
In May 1997, Jürgen Kahrs felt the need for network access from
awk, and with a little help from me, set about adding
features to do this for
gawk. At that time, he also
wrote the bulk of TCP/IP
Internetworking with gawk (a separate document,
available as part of the
gawk distribution). His code
finally became part of the main
gawk distribution with
gawk version 3.1.
John Haque rewrote the
gawk internals, in the
process providing an
awk-level debugger. This version
became available as
gawk version 4.0 in 2011.
See Major Contributors to gawk for a full list of those who
have made important contributions to
awk language has evolved over the years. Full
details are provided in Appendix A. The language
described in this book is often referred to as “new
awk.” By analogy, the original version of
awk is referred to as “old
On most current systems, when you run the
utility you get some version of new
awk. If your system’s standard
awk is the old
one, you will see something like this if you try the test program:
awk 1 /dev/nullerror→ awk: syntax error near line 1 error→ awk: bailing out near line 1
In this case, you should find a version of new
awk, or just install
Throughout this book, whenever we refer to a language feature that
should be available in any complete implementation of POSIX
awk, we simply use the term
When referring to a feature that is specific to the GNU implementation, we
use the term
awk refers to a particular program as
well as to the language you use to tell this program what to do. When we need to be careful, we call the language “the
awk language,” and the program “the
awk utility.” This book explains both how to write
programs in the
awk language and how to run the
awk utility. The term “
refers to a program written by you in the
Primarily, this book explains the features of
as defined in the POSIX standard. It does so in the context of the
gawk implementation. While doing so, it also attempts
to describe important differences between
awk implementations. Finally, it notes any
gawk features that are not in the POSIX standard for
This book has the difficult task of being both a tutorial and a reference. If you are a novice, feel free to skip over details that seem too complex. You should also ignore the many cross-references; they are for the expert user and for the online Info and HTML versions of the book.
There are sidebars scattered throughout the book. They add a more complete explanation of points that are relevant, but not likely to be of interest on first reading.
Most of the time, the examples use complete
programs. Some of the more advanced sections show only the part of the
awk program that illustrates the concept being described.
Although this book is aimed principally at people who have not been
awk, there is a lot of information here that
awk expert should find useful. In particular,
the description of POSIX
awk and the example programs
in Chapter 10 and Chapter 11 should be of interest.
This book is split into several parts, as follows:
awk language and the
gawk program in detail. It starts with the basics,
and continues through all of the features of
It contains the following chapters:
Chapter 1, Getting Started with awk, provides the essentials you need to
know to begin using
Chapter 2, Running awk and gawk,
describes how to run
gawk, the meaning of its
command-line options, and how it finds
program source files.
Chapter 3, Regular Expressions,
introduces regular expressions in general, and in particular the
flavors supported by POSIX
Chapter 4, Reading Input Files,
awk reads your data. It
introduces the concepts of records and fields, as well as the
getline command. I/O redirection is first
described here. Network I/O is also briefly introduced
Chapter 5, Printing Output,
awk programs can produce output
Chapter 6, Expressions, describes expressions, which are the basic building blocks for getting most things done in a program.
Chapter 7, Patterns, Actions, and Variables, describes how to write patterns for
matching records, actions for doing something when a record is
matched, and the predefined variables
Chapter 8, Arrays in awk, covers
awk’s one and only data structure: the
associative array. Deleting array elements and whole arrays is
described, as well as sorting arrays in
The chapter also describes how
arrays of arrays.
Chapter 9, Functions,
describes the built-in functions
gawk provide, as well as how to define your own
functions. It also discusses how
gawk lets you
call functions indirectly.
shows how to use
for problem solving. There is lots of code here for you to read and
learn from. This part contains the following chapters:
Reading these two chapters allows you to see
awk solving real problems.
focuses on features specific to
gawk. It contains
the following chapters:
Chapter 12, Advanced Features of gawk, describes a number of advanced
features. Of particular note are the abilities to control the
order of array traversal, have two-way communications with another
process, perform TCP/IP networking, and profile your
Chapter 13, Internationalization with gawk, describes special features for translating program messages into different languages at runtime.
Chapter 14, Debugging awk Programs,
Chapter 15, Arithmetic and Arbitrary-Precision Arithmetic with gawk, describes advanced arithmetic facilities.
Chapter 16, Writing Extensions for gawk, describes how to add new variables
and functions to
gawk by writing extensions in
C or C++.
Part IV, provides the following appendices, including the GNU General Public License:
Appendix A, describes how the
awk language has evolved since its first
release to the present. It also describes how
gawk has acquired features over time.
describes how to get
gawk, how to compile it on
POSIX-compatible systems, and how to compile and use it on
different non-POSIX systems. It also describes how to report bugs
gawk and where to get other freely available
presents the license that covers the
The version of this book distributed with
contains additional appendices and other end material. To save space, we
have omitted them from the printed edition. You may find them online, as
appendix on implementation notes describes how to disable
gawk’s extensions, how to contribute new code to
gawk, where to find information on some possible
future directions for
gawk development, and the
design decisions behind the extension API.
The appendix on basic concepts provides some very cursory background material for those who are completely unfamiliar with computer programming.
The glossary defines most, if not all, of the significant terms used throughout the book. If you find terms that you aren’t familiar with, try looking them up here.
The GNU FDL is the license that covers this book.
Some of the chapters have exercise sections; these have also been omitted from the print edition but are available online.
This book is written in Texinfo, the GNU documentation formatting language. A single Texinfo source file is used to produce both the printed and online versions of the documentation. Because of this, the typographical conventions are slightly different than in other books you may have read.
Examples you would type at the command line are preceded by the
common shell primary and secondary prompts, ‘
>’. Input that you type is shown
Output from the command, usually its standard output, appears
like this. Error messages and other output on the
command’s standard error are preceded by the glyph “error→”. For
echo hi on stdouthi on stdout $
echo hello on stderr 1>&2error→ hello on stderr
In the text, almost anything related to programming, such as command
names, variable and function names, and string, numeric and regexp
constants appear in
this font. Code fragments appear in
the same font and quoted, ‘
like this’. Things that are
replaced by the user or programmer appear in
font. Options look like this:
are indicated like this:
/path/to/ourfile. The first
occurrence of a new term is usually its definition
and appears in the same font as the previous occurrence of “definition” in
Characters that you type at the keyboard look
this. In particular, there are special characters called “control characters.” These are characters that you
type by holding down both the
CONTROL key and
another key, at the same time. For example, a
Ctrl-d is typed by first pressing and holding the
CONTROL key, next pressing the
d key, and finally releasing both keys.
For the sake of brevity, throughout this book, we refer to Brian
Kernighan’s version of
awk as “BWK
(See Other Freely Available awk Implementations for information on his and other
Notes of interest look like this.
Cautionary or warning notes look like this.
Dark corners are basically fractal—no matter how much you illuminate, there’s always a smaller but darker one.
Until the POSIX standard (and Effective awk
Programming), many features of
either poorly documented or not documented at all. Descriptions of such
features (often called “dark corners”) are noted in this book with
But, as noted by the opening quote, any coverage of dark corners is by definition incomplete.
Extensions to the standard
awk language that
are supported by more than one
awk implementation are
marked “(c.e.)” for “common extension.”
The Free Software Foundation (FSF) is a nonprofit organization dedicated to the production and distribution of freely distributable software. It was founded by Richard M. Stallman, the author of the original Emacs editor. GNU Emacs is the most widely used version of Emacs today.
The GNU Project is an ongoing effort on the part of the Free Software
Foundation to create a complete, freely distributable, POSIX-compliant
computing environment. The FSF uses the GNU General Public License (GPL)
to ensure that its software’s source code is always available to the end
user. The GPL applies to the C language source code for
gawk. To find out more about the FSF and the GNU
Project online, see the GNU Project’s home
page. This book may also be read from GNU’s
The book you are reading is actually free—at least, the information
in it is free to anyone. The machine-readable source code for the book
The book itself has gone through multiple previous editions. Paul
Rubin wrote the very first draft of The GAWK
Manual; it was around 40 pages long. Diane Close and Richard
Stallman improved it, yielding a version that was around 90 pages and
barely described the original, “old” version of
I started working with that version in the fall of 1988. As work on
it progressed, the FSF published several preliminary versions (numbered
x). In 1996, edition 1.0 was released with
gawk 3.0.0. The FSF published the first two editions
under the title The GNU Awk User’s Guide. SSC
published two editions of the book under the title Effective
awk Programming, and O’Reilly published the third edition in
This edition maintains the basic structure of the previous editions.
For FSF edition 4.0, the content was thoroughly reviewed and updated. All
gawk versions prior to 4.0 were removed.
Of significant note for that edition was the addition of Chapter 14.
This book will undoubtedly continue to evolve. If you find an error in the book, please report it! See Reporting Problems and Bugs for information on submitting problem reports electronically.
You may have a newer version of
gawk than the one
described here. To find out what has changed, you should first
look at the
NEWS file in the
gawk distribution, which provides a high-level summary
of the changes in each release.
You can then look at the online version of this book to read about any new features.
This book is here to help you get your job done.
Most of the example programs in this book come in the
distribution and are marked in the files as being in the public domain. So,
in general, you may
use the code in this book in your programs and documentation.
Incorporating a significant amount of prose or
example code from this book into your product’s documentation requires
compliance with the GNU FDL.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Effective awk Programming, Fourth Edition, by Arnold Robbins (O’Reilly). Copyright 2015 Free Software Foundation, 978-1-491-90461-9.”
If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at firstname.lastname@example.org.
Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.
Please address comments and questions concerning this book to the publisher:
|O’Reilly Media, Inc.|
|1005 Gravenstein Highway North|
|Sebastopol, CA 95472|
|800-998-9938 (in the United States or Canada)|
|707-829-0515 (international or local)|
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/effective-awk-programming-4e.
To comment or ask technical questions about this book, send email to email@example.com.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
The initial draft of The GAWK Manual had the following acknowledgments:
Many people need to be thanked for their assistance in producing this manual. Jay Fenlason contributed many ideas and sample programs. Richard Mlynarik and Robert Chassell gave helpful comments on drafts of this manual. The paper A Supplemental Document for awk by John W. Pierce of the Chemistry Department at UC San Diego, pinpointed several issues relevant both to
awkimplementation and to this manual, that would otherwise have escaped us.
I would like to acknowledge Richard M. Stallman, for his vision of a better world and for his courage in founding the FSF and starting the GNU Project.
The previous edition of this book had the following acknowledgments:
The following people (in alphabetical order) provided helpful comments on various versions of this book: Rick Adams, Dr. Nelson H.F. Beebe, Karl Berry, Dr. Michael Brennan, Rich Burridge, Claire Cloutier, Diane Close, Scott Deifik, Christopher (“Topher”) Eliot, Jeffrey Friedl, Dr. Darrel Hankerson, Michal Jaegermann, Dr. Richard J. LeBlanc, Michael Lijewski, Pat Rankin, Miriam Robbins, Mary Sheehan, and Chuck Toporek.
Robert J. Chassell provided much valuable advice on the use of Texinfo. He also deserves special thanks for convincing me not to title this book How to Gawk Politely. Karl Berry helped significantly with the TeX part of Texinfo.
I would like to thank Marshall and Elaine Hartholz of Seattle and Dr. Bert and Rita Schreiber of Detroit for large amounts of quiet vacation time in their homes, which allowed me to make significant progress on this book and on
Phil Hughes of SSC contributed in a very important way by loaning me his laptop GNU/Linux system, not once, but twice, which allowed me to do a lot of work while away from home.
David Trueman deserves special credit; he has done a yeoman job of evolving
gawkso that it performs well and without bugs. Although he is no longer involved with
gawk, working with him on this project was a significant pleasure.
The intrepid members of the GNITS mailing list, and most notably Ulrich Drepper, provided invaluable help and feedback for the design of the internationalization features.
Chuck Toporek, Mary Sheehan, and Claire Cloutier of O’Reilly & Associates contributed significant editorial help for this book for the 3.1 release of
Dr. Nelson Beebe, Andreas Buening, Dr. Manuel Collado, Antonio
Colombo, Stephen Davies, Scott Deifik, Akim Demaille, Darrel Hankerson,
Michal Jaegermann, Jürgen Kahrs, Stepan Kasal, John Malmberg, Dave Pitts,
Chet Ramey, Pat Rankin, Andrew Schorr, Corinna Vinschen, and Eli Zaretskii
(in alphabetical order) make up the current
portability team.” Without their hard work and help,
gawk would not be nearly the robust, portable program
it is today. It has been and continues to be a pleasure working with this
team of fine people.
Notable code and documentation contributions were made by a number of people. See Major Contributors to gawk for the full list.
Thanks to Andy Oram of O’Reilly Media for initiating the fourth edition and for his support during the work. Thanks to Jasmine Kwityn for her copyediting work.
Thanks to Michael Brennan for the Forewords.
Thanks to Patrice Dumas for the new
program. Thanks to Karl Berry, who continues to work to keep the Texinfo
markup language sane.
Robert P.J. Day, Michael Brennan, and Brian Kernighan kindly acted as reviewers for the 2015 edition of this book. Their feedback helped improve the final work.
I would also like to thank Brian Kernighan for his invaluable
assistance during the testing and debugging of
and for his ongoing help and advice in clarifying numerous points about
the language. We could not have done nearly as good a job on either
gawk or its documentation without his help.
Brian is in a class by himself as a programmer and technical author. I have to thank him (yet again) for his ongoing friendship and for being a role model to me for close to 30 years! Having him as a reviewer is an exciting privilege. It has also been extremely humbling...
I must thank my wonderful wife, Miriam, for her patience through the many versions of this project, for her proofreading, and for sharing me with the computer. I would like to thank my parents for their love, and for the grace with which they raised and educated me. Finally, I also must acknowledge my gratitude to G-d, for the many opportunities He has sent my way, as well as for the gifts He has given me with which to take advantage of those opportunities.
 These utilities are available on POSIX-compliant systems, as well as on traditional Unix-based systems. If you are using some other operating system, you still need to be familiar with the ideas of I/O redirection and pipes.
 Some other, obsolete systems to which
once ported are no longer supported and the code for those systems has
 Only Solaris systems still use an old
awk utility. A more modern
awk lives in
/usr/xpg6/bin on these systems.
 GNU stands for “GNU’s Not Unix.”