O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning AWK Programming

Book Description

Text processing and pattern matching simplified

About This Book

  • Master the fastest and most elegant big data munging language
  • Implement text processing and pattern matching using the advanced features of AWK and GAWK
  • Implement debugging and inter-process communication using GAWK

Who This Book Is For

This book is for developers or analysts who are inclined to learn how to do text processing and data extraction in a Unix-like environment. Basic understanding of Linux operating system and shell scripting will help you to get the most out of the book.

What You Will Learn

  • Create and use different expressions and control flow statements in AWK
  • Use Regular Expressions with AWK for effective text-processing
  • Use built-in and user-defined variables to write AWK programs
  • Use redirections in AWK programs and create structured reports
  • Handle non-decimal input, 2-way inter-process communication with Gawk
  • Create small scripts to reformat data to match patterns and process texts

In Detail

AWK is one of the most primitive and powerful utilities which exists in all Unix and Unix-like distributions. It is used as a command-line utility when performing a basic text-processing operation, and as programming language when dealing with complex text-processing and mining tasks. With this book, you will have the required expertise to practice advanced AWK programming in real-life examples.

The book starts off with an introduction to AWK essentials. You will then be introduced to regular expressions, AWK variables and constants, arrays and AWK functions and more. The book then delves deeper into more complex tasks, such as printing formatted output in AWK, control flow statements, GNU's implementation of AWK covering the advanced features of GNU AWK, such as network communication, debugging, and inter-process communication in the GAWK programming language which is not easily possible with AWK.

By the end of this book, the reader will have worked on the practical implementation of text processing and pattern matching using AWK to perform routine tasks.

Style and approach

An easy-to-follow, step by step guide which will help you get to grips with real-world applications of AWK programming.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Learning AWK Programming
  3. Dedication
  4. Packt Upsell
    1. Why subscribe?
    2. PacktPub.com
  5. Contributors
    1. About the author
    2. About the reviewers
    3. Packt is searching for authors like you
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Conventions used
    4. Get in touch
      1. Reviews
  7. Getting Started with AWK Programming
    1. AWK programming language overview
      1. What is AWK?
      2. Types of AWK
      3. When and where to use AWK
    2. Getting started with AWK
      1. Installation on Linux
        1. Using the package manager
        2. Compiling from the source code
      2. Workflow of AWK
      3. Action and pattern structure of AWK
        1. Example data file
        2. Pattern-only statements
        3. Action-only statements
        4. Printing each input line/record
        5. Using the BEGIN and END blocks construct
        6. The BEGIN block
        7. The body block
        8. The END block
          1. Patterns
          2. Actions
      4. Running AWK programs
        1. AWK as a Unix command line
        2. AWK as a filter (reading input from the Terminal)
        3. Running AWK programs from the source file
        4. AWK programs as executable script files
        5. Extending the AWK command line on multiple lines
        6. Comments in AWK
        7. Shell quotes with AWK
      5. Data files used as examples in this book
      6. Some simple examples with default usage
        1. Multiple rules with AWK
        2. Using standard input with names in AWK
    3. AWK standard options
      1. Standard command-line options
        1. The -F option – field separator
        2. The -f option (read source file)
        3. The -v option (assigning variables)
      2. GAWK-only options
        1. The --dump-variables option (AWK global variables)
        2. The --profile option (profiling)
        3. The --sandbox option
        4. The -i option (including other files in your program)
        5. Include other files in the GAWK program (using @include)
        6. The -V option
    4. Summary
  8. Working with Regular Expressions
    1. Introduction to regular expressions
      1. What is a regular expression?
      2. Why use regular expressions?
      3. Using regular expressions with AWK
        1. Regular expressions as string-matching patterns with AWK
    2. Basic regular expression construct
    3. Understanding regular expression metacharacters
      1. Quoted metacharacter
      2. Anchors
        1. Matching at the beginning of a string
        2. Matching at the end of a string
      3. Dot
      4. Brackets expressions
        1. Character classes
        2. Named character classes (POSIX standard)
      5. Complemented bracket expressions
        1. Complemented character classes
        2. Complemented named character classes
      6. Alternation operator
      7. Unary operator for repetition
        1. Closure
        2. Positive closure
        3. Zero or one
      8. Repetition ranges with interval expressions
        1. A single number in brackets
        2. A single number followed by a comma in brackets
        3. Two numbers in brackets
      9. Grouping using parentheses
        1. Concatenation using alternation operator within parentheses
        2. Backreferencing in regular expressions – sed and grep
    4. Precedence in regular expressions
    5. GAWK-specific regular expression operators
      1. Matching whitespaces
      2. Matching not whitespaces
      3. Matching words (\w)
      4. Matching non-words
      5. Matching word boundaries
        1. Matching at the beginning of a word 
        2. Matching at the end of a word 
      6. Matching not as a sub-string using
      7. Matching a string as sub-string only using
    6. Case-sensitive matching
    7. Escape sequences
    8. Summary
  9. AWK Variables and Constants
    1. Built-in variables in AWK
      1. Field separator 
        1. Using a single character or simple string as a value of the FS
        2. Using regular expressions as values of the FS
        3. Using each character as a separate field
        4. Using the command line to set the FS as -F
      2. Output field separator
      3. Record separator
      4. Outputting the record separator
      5. NR and NF
      6. FILENAME
    2. Environment variables in AWK
      1. ARGC and ARGV
      2. CONVFMT and OFMT
      3. RLENGTH and RSTART
      4. FNR
      5. ENVIRON and SUBSET
      6. FIELD (POSITIONAL) VARIABLE ($0 and $n)
    3. Environment variables in GAWK
      1. ARGIND
      2. ERRNO
      5. PROCINFO
    4. String constants
    5. Numeric constants
    6. Conversion between strings and numbers
    7. Summary
  10. Working with Arrays in AWK
    1. One-dimensional arrays
    2. Assignment in arrays
    3. Accessing elements in arrays
    4. Referring to members in arrays
    5. Processing arrays using loops
    6. Using the split() function to create arrays
    7. Delete operation in arrays
    8. Multidimensional arrays
    9. Summary
  11. Printing Output in AWK
    1. The print statement
    2. Role of output separator in print statement
    3. Pretty printing with the printf statement
    4. Escape sequences for special character printing
    5. Different format control characters in the format specifier
    6. Format specification modifiers
      1. Printing with fixed column width 
      2. Using the minus modifier (-) for left justification
      3. Printing with fixed width – right justified
      4. Using hash modifier (#)
      5. Using plus modifier (+) for prefixing with sign/symbol
      6. Printing with prefix sign/symbol
      7. Dot precision as modifier
      8. Positional modifier using integer constant followed by $ (N$):
    7. Redirecting output to file
      1. Redirecting output to a file (>)
      2. Appending output to a file (>>)
      3. Sending output on other commands using pipe (|)
      4. Special file for redirecting output (/dev/null, stderr)
      5. Closing files and pipes
    8. Summary
  12. AWK Expressions
    1. AWK variables and constants
    2. Arithmetic expressions using binary operators
    3. Assignment expressions
    4. Increment and decrement expressions
    5. Relational expressions
    6. Logical or Boolean expressions
    7. Ternary expressions
    8. Unary expressions
    9. Exponential expressions
    10. String concatenation
    11. Regular expression operators
    12. Operators' Precedence
    13. Summary
  13. AWK Control Flow Statements
    1. Conditional statements
      1. The if statement
        1. if
        2. If...else
        3. The if...else...if statement
      2. The switch statement (a GAWK-specific feature)
    2. Looping statement
      1. The while loop
      2. do...while loop statement
      3. The for loop statement
      4. For each loop statement
    3. Statements affecting flow control
      1. Break usage
      2. Usage of continue
      3. Exit usage
      4. Next usage
    4. Summary
  14. AWK Functions
    1. Built-in functions
      1. Arithmetic functions
        1. The sin (expr) function
        2. The cos (expr) function
        3. The atan2 (x, y) function 
        4. The int (expr) function
        5. The exp (expr) function
        6. The log (expr) function
        7. The sqrt (expr) function
        8. The rand() function
        9. The srand ([expr]) function 
          1. Summary table of built-in arithmetic functions
      2. String functions
        1. The index (str, sub) function
        2. The length ( string ) function
        3. The split (str, arr, regex) function
        4. The substr (str, start, [ length ]) function
        5. The sub (regex, replacement, string) function
        6. The gsub (regex, replacement, string) function
        7. The gensub (regex, replacement, occurrence, [ string ]) function
        8. The match (string, regex) function
        9. The tolower (string) function
        10. The toupper (string) function
        11. The sprintf (format, expression) function
        12. The strtonum (string) function
          1. Summary table of built-in string functions
      3. Input/output functions
        1. The close (filename [to/from]) function
        2. The fflush ([ filename ]) function
        3. The system (command) function
        4. The getline command
          1. Simple getline
          2. Getline into a variable
          3. Getline from a file
          4. Using getline to get a variable from a file
          5. Using getline to output into a pipe
          6. Using getline to change the output into a variable from a pipe
          7. Using getline to change the output into a variable from a coprocess
        5. The nextfile() function
      4. The time function
        1. The systime() function
        2. The mktime (datespec) function
        3. The strftime (format, timestamp) function
      5. Bit-manipulating functions
        1. The and (num1, num2) function
        2. The or (num1, num2) function
        3. The xor (num1, num2) function
        4. The lshift (val, count) function
        5. The rshift (val, count) function
        6. The compl (num) function
    2. User-defined functions
      1. Function definition and syntax
      2. Calling user-defined functions
      3. Controlling variable scope
      4. Return statement
      5. Making indirect function calls
    3. Summary
  15. GNU's Implementation of AWK – GAWK (GNU AWK)
    1. Things you don't know about GAWK
      1. Reading non-decimal input
      2. GAWK's built-in command line debugger
        1. What is debugging?
        2. Debugger concepts
        3. Using GAWK as a debugger
          1. Starting the debugger
          2. Set breakpoint
          3. Removing the breakpoint
          4. Running the program
          5. Looking inside the program
          6. Displaying some variables and data
          7. Setting watch and unwatch
          8. Controlling the execution
          9. Viewing environment information
          10. Saving the commands in file
          11. Exiting the debugger
      3. Array sorting
        1. Sort array by values using asort( )
        2. Sort array indexes using asorti()
      4. Two-way inter-process communication
      5. Using GAWK for network programming
        1. TCP client and server (/inet/tcp)
        2. UDP client and server ( /inet/udp )
        3. Reading a web page using HttpService
      6. Profiling
    2. Summary
  16. Practical Implementation of AWK
    1. Working with one-liners for text processing and pattern matching with AWK
      1. Selective printing of lines with AWK
      2. Modifying line spacing in a file with AWK
      3. Numbering and calculations with AWK
      4. Selective deletion of certain lines in a file with AWK
      5. String operation on selected lines with AWK
      6. Array creation with AWK one-liner
      7. Text conversion and substitution in files with AWK
      8. One-liners for system administrators
    2. Use case examples of pattern matching using AWK
      1. Parsing web server (Apache/Nginx) log files
        1. Understanding the Apache combined log format
        2. Using AWK for processing different log fields 
        3. Identifying problems with the running website
        4. Printing the top 10 request IP addresses with their GeoIP information
        5. Counting and printing unique visits to a website
        6. Real-time IP address lookup for requests
      2. Converting text to HTML table
      3. Converting decimal to binary
      4. Renaming files in a directory with AWK
      5. Printing a generated sequence of numbers in a specified columnate format
      6. Transposing a matrix
      7. Processing multiple files using AWK
    3. Summary
    4. Further reading