BUY THIS BOOK
Add to Cart

Print Book $34.95


Add to Cart

Print+PDF $45.44

Add to Cart

PDF $27.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £24.95

What is this?

Looking to Reprint or License this content?


Classic Shell Scripting
Classic Shell Scripting Hidden Commands that Unlock the Power of Unix By Arnold Robbins, Nelson H.F. Beebe
May 2005
Pages: 558

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Background
This chapter provides a brief history of the development of the Unix system. Understanding where and how Unix developed and the intent behind its design will help you use the tools better. The chapter also introduces the guiding principles of the Software Tools philosophy, which are then demonstrated throughout the rest of the book.
It is likely that you know something about the development of Unix, and many resources are available that provide the full story. Our intent here is to show how the environment that gave birth to Unix influenced the design of the various tools.
Unix was originally developed in the Computing Sciences Research Center at Bell Telephone Laboratories. The first version was developed in 1970, shortly after Bell Labs withdrew from the Multics project. Many of the ideas that Unix popularized were initially pioneered within the Multics operating system; most notably the concepts of devices as files, and of having a command interpreter (or shell ) that was intentionally not integrated into the operating system. A well-written history may be found at http://www.bell-labs.com/history/unix.
Because Unix was developed within a research-oriented environment, there was no commercial pressure to produce or ship a finished product. This had several advantages:
  • The system was developed by its users. They used it to solve real day-to-day computing problems.
  • The researchers were free to experiment and to change programs as needed. Because the user base was small, if a program needed to be rewritten from scratch, that generally wasn't a problem. And because the users were the developers, they were free to fix problems as they were discovered and add enhancements as the need for them arose.
  • Unix itself went through multiple research versions, informally referred to with the letter "V" and a number: V6, V7, and so on. (The formal name followed the edition number of the published manual: First Edition, Second Edition, and so on. The correspondence between the names is direct: V6 = Sixth Edition, and V7 = Seventh Edition. Like most experienced Unix programmers, we use both nomenclatures.) The most influential Unix system was the Seventh Edition, released in 1979, although earlier ones had been available to educational institutions for several years. In particular, the Seventh Edition system introduced both
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Unix History
It is likely that you know something about the development of Unix, and many resources are available that provide the full story. Our intent here is to show how the environment that gave birth to Unix influenced the design of the various tools.
Unix was originally developed in the Computing Sciences Research Center at Bell Telephone Laboratories. The first version was developed in 1970, shortly after Bell Labs withdrew from the Multics project. Many of the ideas that Unix popularized were initially pioneered within the Multics operating system; most notably the concepts of devices as files, and of having a command interpreter (or shell ) that was intentionally not integrated into the operating system. A well-written history may be found at http://www.bell-labs.com/history/unix.
Because Unix was developed within a research-oriented environment, there was no commercial pressure to produce or ship a finished product. This had several advantages:
  • The system was developed by its users. They used it to solve real day-to-day computing problems.
  • The researchers were free to experiment and to change programs as needed. Because the user base was small, if a program needed to be rewritten from scratch, that generally wasn't a problem. And because the users were the developers, they were free to fix problems as they were discovered and add enhancements as the need for them arose.
  • Unix itself went through multiple research versions, informally referred to with the letter "V" and a number: V6, V7, and so on. (The formal name followed the edition number of the published manual: First Edition, Second Edition, and so on. The correspondence between the names is direct: V6 = Sixth Edition, and V7 = Seventh Edition. Like most experienced Unix programmers, we use both nomenclatures.) The most influential Unix system was the Seventh Edition, released in 1979, although earlier ones had been available to educational institutions for several years. In particular, the Seventh Edition system introduced both
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Software Tools Principles
Over the course of time, a set of core principles developed for designing and writing software tools. You will see these exemplified in the programs used for problem solving throughout this book. Good software tools should do the following things:
Do one thing well
In many ways, this is the single most important principle to apply. Programs that do only one thing are easier to design, easier to write, easier to debug, and easier to maintain and document. For example, a program like grep that searches files for lines matching a pattern should not also be expected to perform arithmetic.
A natural consequence of this principle is a proliferation of smaller, specialized programs, much as a professional carpenter has a large number of specialized tools in his toolbox.
Process lines of text, not binary
Lines of text are the universal format in Unix. Datafiles containing text lines are easy to process when writing your own tools, they are easy to edit with any available text editor, and they are portable across networks and multiple machine architectures. Using text files facilitates combining any custom tools with existing Unix programs.
Use regular expressions
Regular expressions are a powerful mechanism for working with text. Understanding how they work and using them properly simplifies your script-writing tasks.
Furthermore, although regular expressions varied across tools and Unix versions over the years, the POSIX standard provides only two kinds of regular expressions, with standardized library routines for regular-expression matching. This makes it possible for you to write your own tools that work with regular expressions identical to those of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary
Unix was originally developed at Bell Labs by and for computer scientists. The lack of commercial pressure, combined with the small capacity of the PDP-11 minicomputer, led to a quest for small, elegant programs. The same lack of commercial pressure, though, led to a system that wasn't always consistent, nor easy to learn.
As Unix spread and variant versions developed (notably the System V and BSD variants), portability at the shell script level became difficult. Fortunately, the POSIX standardization effort has borne fruit, and just about all commercial Unix systems and free Unix workalikes are POSIX-compliant.
The Software Tools principles as we've outlined them provide the guidelines for the development and use of the Unix toolset. Thinking with the Software Tools mindset will help you write clear shell programs that make correct use of the Unix tools.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Getting Started
When you need to get some work done with a computer, it's best to use a tool that's appropriate to the job at hand. You don't use a text editor to balance your checkbook or a calculator to write a proposal. So too, different programming languages meet different needs when it comes time to get some computer-related task done.
Shell scripts are used most often for system administration tasks, or for combining existing programs to accomplish some small, specific job. Once you've figured out how to get the job done, you can bundle up the commands into a separate program, or script, which you can then run directly. What's more, if it's useful, other people can make use of the program, treating it as a black box, a program that gets a job done, without their having to know how it does so.
In this chapter we'll make a brief comparison between different kinds of programming languages, and then get started writing some simple shell scripts.
Most medium and large-scale programs are written in a compiled language, such as Fortran, Ada, Pascal, C, C++, or Java. The programs are translated from their original source code into object code which is then executed directly by the computer's hardware.
The benefit of compiled languages is that they're efficient. Their disadvantage is that they usually work at a low level, dealing with bytes, integers, floating-point numbers, and other machine-level kinds of objects. For example, it's difficult in C++ to say something simple like "copy all the files in this directory to that directory over there."
So-called scripting languages are usually interpreted. A regular compiled program, the interpreter , reads the program, translates it into an internal form, and then executes the program.
The advantage to scripting languages is that they often work at a higher level than compiled languages, being able to deal more easily with objects such as files and directories. The disadvantage is that they are often less efficient than compiled languages. Usually the tradeoff is worthwhile; it can take an hour to write a simple script that would take two days to code in C or C++, and usually the script will run fast enough that performance won't be a problem. Examples of scripting languages include
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Scripting Languages Versus Compiled Languages
Most medium and large-scale programs are written in a compiled language, such as Fortran, Ada, Pascal, C, C++, or Java. The programs are translated from their original source code into object code which is then executed directly by the computer's hardware.
The benefit of compiled languages is that they're efficient. Their disadvantage is that they usually work at a low level, dealing with bytes, integers, floating-point numbers, and other machine-level kinds of objects. For example, it's difficult in C++ to say something simple like "copy all the files in this directory to that directory over there."
So-called scripting languages are usually interpreted. A regular compiled program, the interpreter , reads the program, translates it into an internal form, and then executes the program.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Use a Shell Script?
The advantage to scripting languages is that they often work at a higher level than compiled languages, being able to deal more easily with objects such as files and directories. The disadvantage is that they are often less efficient than compiled languages. Usually the tradeoff is worthwhile; it can take an hour to write a simple script that would take two days to code in C or C++, and usually the script will run fast enough that performance won't be a problem. Examples of scripting languages include awk, Perl, Python, Ruby, and the shell.
Because the shell is universal among Unix systems, and because the language is standardized by POSIX, shell scripts can be written once and, if written carefully, used across a range of systems. Thus, the reasons to use a shell script are:
Simplicity
The shell is a high-level language; you can express complex operations clearly and simply using it.
Portability
By using just POSIX-specified features, you have a good chance of being able to move your script, unchanged, to different kinds of systems.
Ease of development
You can often write a powerful, useful script in little time.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Simple Script
Let's start with a simple script. Suppose that you'd like to know how many users are currently logged in. The who command tells you who is logged in:
$ who
george     pts/2        Dec 31 16:39    (valley-forge.example.com)
betsy      pts/3        Dec 27 11:07    (flags-r-us.example.com)
benjamin   dtlocal      Dec 27 17:55    (kites.example.com)
jhancock   pts/5        Dec 27 17:55    (:32)
camus      pts/6        Dec 31 16:22
tolstoy    pts/14       Jan  2 06:42
On a large multiuser system, the listing can scroll off the screen before you can count all the users, and doing that every time is painful anyway. This is a perfect opportunity for automation. What's missing is a way to count the number of users. For that, we use the wc (word count) program, which counts lines, words, and characters. In this instance, we want wc -l, to count just lines:
$ who | wc -l                       
            Count users
      6
The | (pipe) symbol creates a pipeline between the two programs: who's output becomes wc's input. The result, printed by wc, is the number of users logged in.
The next step is to make this pipeline into a separate command. You do this by entering the commands into a regular file, and then making the file executable, with chmod, like so:
$ cat > nusers                      
            Create the file, copy terminal input with cat
            who | wc -l                         
            Program text
            ^D                                  
            Ctrl-D is end-of-file
$ chmod +x nusers                   
            Make it executable
$ ./nusers                          
            Do a test run
      6                             Output is what we expect
         
This shows the typical development cycle for small one- or two-line shell scripts: first, you experiment directly at the command line. Then, once you've figured out the proper incantations to do what you want, you put them into a separate script and make the script executable. You can then use that script directly from now on.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Self-Contained Scripts: The #! First Line
When the shell runs a program, it asks the Unix kernel to start a new process and run the given program in that process. The kernel knows how to do this for compiled programs. Our nusers shell script isn't a compiled program; when the shell asks the kernel to run it, the kernel will fail to do so, returning a "not executable format file" error. The shell, upon receiving this error, says "Aha, it's not a compiled program, it must be a shell script," and then proceeds to start a new copy of /bin/sh (the standard shell) to run the program.
The "fall back to /bin/sh" mechanism is great when there's only one shell. However, because current Unix systems have multiple shells, there needs to be a way to tell the Unix kernel which shell to use when running a particular shell script. In fact, it helps to have a general mechanism that makes it possible to directly invoke any programming language interpreter, not just a command shell. This is done via a special first line in the script file—one that begins with the two characters #!.
When the first two characters of a file are #!, the kernel scans the rest of the line for the full pathname of an interpreter to use to run the program. (Any intervening whitespace is skipped.) The kernel also scans for a single option to be passed to that interpreter. The kernel invokes the interpreter with the given option, along with the rest of the command line. For example, assume a csh script named /usr/ucb/whizprog, with this first line:
#! /bin/csh -f
Furthermore, assume that /usr/ucb is included in the shell's search path (described later). A user might type the command whizprog -q /dev/tty01. The kernel interprets the #! line and invokes csh as follows:
/bin/csh -f /usr/ucb/whizprog -q /dev/tty01
This mechanism makes it easy to invoke any interpreted language. For example, it is a good way to invoke a standalone awk program:
#! /bin/awk -f
awk program here
         
Shell scripts typically start with
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Basic Shell Constructs
In this section we introduce the basic building blocks used in just about all shell scripts. You will undoubtedly be familiar with some or all of them from your interactive use of the shell.
The shell's most basic job is simply to execute commands. This is most obvious when the shell is being used interactively: you type commands one at a time, and the shell executes them, like so:
$ cd work ; ls -l whizprog.c
-rw-r--r--    1 tolstoy   devel       30252 Jul  9 22:52 whizprog.c
$ make
...
These examples show the basics of the Unix command line. First, the format is simple, with whitespace (space and/or tab characters) separating the different components involved in the command.
Second, the command name, rather logically, is the first item on the line. Most typically, options follow, and then any additional arguments to the command follow the options. No gratuitous syntax is involved, such as:
COMMAND=CD,ARG=WORK
COMMAND=LISTFILES,MODE=LONG,ARG=WHIZPROG.C
Such command languages were typical of the larger systems available when Unix was designed. The free-form syntax of the Unix shell was a real innovation in its time, contributing notably to the readability of shell scripts.
Third, options start with a dash (or minus sign) and consist of a single letter. Options are optional, and may require an argument (such as cc -o whizprog whizprog.c). Options that don't require an argument can be grouped together: e.g., ls -lt whizprog.c rather than ls -l -t whizprog.c (which works, but requires more typing).
Long options are increasingly common, particularly in the GNU variants of the standard utilities, as well as in programs written for the X Window System (X11). For example:
$ cd whizprog-1.1
$ patch --verbose --backup -p1 < /tmp/whizprog-1.1-1.2-patch
            
Depending upon the program, long options start with either one dash, or with two (as just shown). (The
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Accessing Shell Script Arguments
The so-called positional parameters represent a shell script's command-line arguments. They also represent a function's arguments within shell functions. Individual arguments are named by integer numbers. For historical reasons, you have to enclose the number in braces if it's greater than nine:
echo first arg is $1
echo tenth arg is ${10}
Special "variables" provide access to the total number of arguments that were passed, and to all the arguments at once. We provide the details later, in Section 6.1.2.2.
Suppose you want to know what terminal a particular user is using. Well, once again, you could use a plain who command and manually scan the output. However, that's difficult and error prone, especially on systems with lots of users. This time what you want to do is search through who's output for a particular user. Well, anytime you want to do searching, that's a job for the grep command, which prints lines matching the pattern given in its first argument. Suppose you're looking for user betsy because you really need that flag you ordered from her:
$ who | grep betsy                      
            Where is betsy?
betsy      pts/3        Dec 27 11:07    (flags-r-us.example.com)
Now that we know how to find a particular user, we can put the commands into a script, with the script's first argument being the username we want to find:
$ cat > finduser                        
            Create new file
            #! /bin/sh

            # finduser --- see if user named by first argument is logged in

            who | grep $1
            ^D                                      
            End-of-file

$ chmod +x finduser                     
            Make it executable

$ ./finduser betsy                      
            Test it: find betsy
betsy      pts/3        Dec 27 11:07    (flags-r-us.example.com)

$ ./finduser benjamin                   
            Now look for good old Ben
benjamin   dtlocal      Dec 27 17:55    (kites.example.com)

$ 
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Simple Execution Tracing
Because program development is a human activity, there will be times when your script just doesn't do what you want it to do. One way to get some idea of what your program is doing is to turn on execution tracing. This causes the shell to print out each command as it's executed, preceded by "+ "—that is, a plus sign followed by a space. (You can change what gets printed by assigning a new value to the PS4 shell variable.) For example:
$ sh -x nusers                          
            Run with tracing on
+ who                                   Traced commands
+ wc -l
      7                                 Actual output
         
You can turn execution tracing on within a script by using the command set -x, and turn it off again with set +x. This is more useful in fancier scripts, but here's a simple program to demonstrate:
$ cat > trace1.sh                       
            Create script
            #! /bin/sh

            set -x                                  
            Turn on tracing
            echo 1st echo                           
            Do something

            set +x                                  
            Turn off tracing
            echo 2nd echo                           
            Do something else
            ^D                                      
            Terminate with end-of-file

$ chmod +x trace1.sh                    
            Make program executable

$ ./trace1.sh                           
            Run it
+ echo 1st echo                         First traced line
1st echo                                Output from command
+ set +x                                Next traced line
2nd echo                                Output from next command
         
When run, the set -x is not traced, since tracing isn't turned on until after that command completes. Similarly, the set +x is traced, since tracing isn't turned off until after it completes. The final echo isn't traced, since tracing is turned off at that point.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Internationalization and Localization
Writing software for an international audience is a challenging problem. The task is usually divided into two parts: internationalization (i18n for short, since that long word has 18 letters between the first and last), and localization (similarly abbreviated l10n).
Internationalization is the process of designing software so that it can be adapted for specific user communities without having to change or recompile the code. At a minimum, this means that all character strings must be wrapped in library calls that handle runtime lookup of suitable translations in message catalogs. Typically, the translations are specified in ordinary text files that accompany the software, and then are compiled by gencat or msgfmt into compact binary files organized for fast lookup. The compiled message catalogs are then installed in a system-specific directory tree, such as the GNU conventional /usr/share/locale and /usr/local/share/locale, or on commercial Unix systems, /usr/lib/nls or /usr/lib/locale. Details can be found in the manual pages for setlocale(3), catgets(3C), and gettext(3C).
Localization is the process of adapting internationalized software for use by specific user communities. This may require translating software documentation, and all text strings output by the software, and possibly changing the formats of currency, dates, numbers, times, units of measurement, and so on, in program output. The character set used for text may also have to be changed, unless the universal Unicode character set can be used, and different fonts may be required. For some languages, the writing direction has to be changed as well.
In the Unix world, ISO programming language standards and POSIX have introduced limited support for addressing these problems, but much remains to be done, and progress varies substantially across the various flavors of Unix. For the user, the feature that controls which language or cultural environment is in effect is called the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary
The choice of compiled language versus scripting language is usually made based on the need of the application. Scripting languages generally work at a higher level than compiled languages, and the loss in performance is often more than made up for by the speed with which development can be done and the ability to work at a higher level.
The shell is one of the most important and widely used scripting languages in the Unix environment. Because it is ubiquitous, and because of the POSIX standard, it is possible to write shell programs that will work on many different vendor platforms. Because the shell functions at a high level, shell programs have a lot of bang for the buck; you can do a lot with relatively little work.
The #! first line should be used for all shell scripts; this mechanism provides you with flexibility, and the ability to write scripts in your choice of shell or other language.
The shell is a full programming language. So far we covered the basics of commands, options, arguments, and variables, and basic output with echo and printf. We also looked at the basic I/O redirection operators, <, >, >>, and |, with which we expect you're really already familiar.
The shell looks for commands in each directory in $PATH. It's common to have a personal bin directory in which to store your own private programs and scripts, and to list it in PATH by doing an assignment in your .profile file.
We looked at the basics of accessing command-line arguments and simple execution tracing.
Finally, we discussed internationalization and localization, topics that are growing in importance as computer systems are adapted to the computing needs of more of the world's people. While support in this area for shell scripts is still limited, shell programmers need to be aware of the influence of locales on their code.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Searching and Substitutions
As we discussed in Section 1.2, Unix programmers prefer to work on lines of text. Textual data is more flexible than binary data, and Unix systems provide a number of tools that make slicing and dicing text easy.
In this chapter, we look at two fundamental operations that show up repeatedly in shell scripting: text searching—looking for specific lines of text—and text substitution—changing the text that is found.
While you can accomplish many things by using simple constant text strings, regular expressions provide a much more powerful notation for matching many different actual text fragments with a single expression. This chapter introduces the two regular expression "flavors" provided by various Unix programs, and then proceeds to cover the most important tools for text extraction and rearranging.
The workhorse program for finding text (or "matching text," in Unix jargon) is grep. On POSIX systems, grep can use either of the two regular expression flavors, or match simple strings.
Traditionally, there were three separate programs for searching through text files:
grep
The original text-matching program. It uses Basic Regular Expressions (BREs) as defined by POSIX, and as we describe later in the chapter.
egrep
"Extended grep." This program uses Extended Regular Expressions (EREs), which are a more powerful regular expression notation. The cost of EREs is that they can be more computationally expensive to use. On the original PDP-11s this was important; on modern systems, there is little difference.
fgrep
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Searching for Text
The workhorse program for finding text (or "matching text," in Unix jargon) is grep. On POSIX systems, grep can use either of the two regular expression flavors, or match simple strings.
Traditionally, there were three separate programs for searching through text files:
grep
The original text-matching program. It uses Basic Regular Expressions (BREs) as defined by POSIX, and as we describe later in the chapter.
egrep
"Extended grep." This program uses Extended Regular Expressions (EREs), which are a more powerful regular expression notation. The cost of EREs is that they can be more computationally expensive to use. On the original PDP-11s this was important; on modern systems, there is little difference.
fgrep
"Fast grep." This variant matches fixed strings instead of regular expressions using an algorithm optimized for fixed-string matching. The original version was also the only variant that could match multiple strings in parallel. In other words, grep and egrep could match only a single regular expression, whereas fgrep used a different algorithm that could match multiple strings, effectively testing each input line for a match against all the requested search strings.
The 1992 POSIX standard merged all three variants into one grep program whose behavior is controlled by different options. The POSIX version can match multiple patterns, even for BREs and EREs. Both fgrep and egrep were also available, but they were marked as "deprecated," meaning that they would be removed from a subsequent standard. And indeed, the 2001 POSIX standard only includes the merged
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Regular Expressions
This section provides a brief review of regular expression construction and matching. In particular, it describes the POSIX BRE and ERE constructs, which are intended to formalize the two basic "flavors" of regular expressions found among most Unix utilities.
We expect that you've had some exposure to regular expressions and text matching prior to this book. In that case, these subsections summarize how you can expect to use regular expressions for portable shell scripting.
If you've had no exposure at all to regular expressions, the material here may be a little too condensed for you, and you should detour to a more introductory source, such as Learning the Unix Operating System (O'Reilly) or sed & awk (O'Reilly). Since regular expressions are a fundamental part of the Unix tool-using and tool-building paradigms, any investment you make in learning how to use them, and use them well, will be amply rewarded, multifold, time after time.
If, on the other hand, you've been chopping, slicing, and dicing text with regular expressions for years, you may find our coverage cursory. If such is the case, we recommend that you review the first part, which summarizes POSIX BREs and EREs in tabular form, skip the rest of the section, and move on to a more in-depth source, such as Mastering Regular Expressions (O'Reilly).
Regular expressions are a notation that lets you search for text that fits a particular criterion, such as "starts with the letter a." The notation lets you write a single expression that can select, or match, multiple data strings.
Above and beyond traditional Unix regular expression notation, POSIX regular expressions let you:
  • Write regular expressions that express locale-specific character sequence orderings and equivalences
  • Write your regular expressions in a way that does not depend upon the underlying character set of the system
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Working with Fields
For many applications, it's helpful to view your data as consisting of records and fields. A record is a single collection of related information, such as what a business might have for a customer, supplier, or employee, or what a school might have for a student. A field is a single component of a record, such as a last name, a first name, or a street address.
Because Unix encourages the use of textual data, it's common to store data in a text file, with each line representing a single record. There are two conventions for separating fields within a line from each other. The first is to just use whitespace (spaces or tabs):
$ cat myapp.data
# model     units sold     salesperson
xj11        23             jane
rj45        12             joe
cat6        65             chris
...
In this example, lines beginning with a # character represent comments, and are ignored. (This is a common convention. The ability to have comment lines is helpful, but it requires that your software be able to ignore such lines.) Each field is separated from the next by an arbitrary number of space or tab characters. The second convention is to use a particular delimiter character to separate fields, such as a colon:
$ cat myapp.data
# model:units sold:salesperson
xj11:23:jane
rj45:12:joe
cat6:65:chris
...
Each convention has advantages and disadvantages. When whitespace is the separator, it's difficult to have real whitespace within the fields' contents. (If you use a tab as the separator, you can use a space character within a field, but this is visually confusing, since you can't easily tell the difference just by looking at the file.) On the flip side, if you use an explicit delimiter character, it then becomes difficult to include that delimiter within your data. Often, though, it's possible to make a careful choice, so that the need to include the delimiter becomes minimal or nonexistent.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary
The grep program is the primary tool for extracting interesting lines of text from input datafiles. POSIX mandates a single version with different options to provide the behavior traditionally obtained from the three grep variants: grep, egrep, and fgrep.
Although you can search for plain string constants, regular expressions provide a more powerful way to describe text to be matched. Most characters match themselves, whereas certain others act as metacharacters, specifying actions such as "match zero or more of," "match exactly 10 of," and so on.
POSIX regular expressions come in two flavors: Basic Regular Expressions (BREs) and Extended Regular Expressions (EREs). Which programs use which regular expression flavor is based upon historical practice, with the POSIX specification reducing the number of regular expression flavors to just two. For the most part, EREs are a superset of BREs, but not completely.
Regular expressions are sensitive to the locale in which the program runs; in particular, ranges within a bracket expression should be avoided in favor of character classes such as [[:alnum:]]. Many GNU programs have additional metacharacters.
sed is the primary tool for making simple string substitutions. Since, in our experience, most shell scripts use sed only for substitutions, we have purposely not covered everything sed can do. The sed & awk book listed in the Chapter 16 provides more information.
The "longest leftmost" rule describes where text matches and for how long the match extends. This is important when doing text substitutions with sed, awk, or an interactive text editor. It is also important to understand when there is a distinction between a line and a string. In some programming languages, a single string may contain multiple lines, in which case ^ and $ usually apply to the beginning and end of the string.
For many operations, it's useful to think of each line in a text file as an individual record, with data in the line consisting of fields. Fields are separated by either whitespace or a special delimiter character, and different Unix tools are available to work with both kinds of data. The
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Text Processing Tools
Some operations on text files are so widely applicable that standard tools for those tasks were developed early in the Unix work at Bell Labs. In this chapter, we look at the most important ones.
Text files that contain independent records of data are often candidates for sorting. A predictable record order makes life easier for human users: book indexes, dictionaries, parts catalogs, and telephone directories have little value if they are unordered. Sorted records can also make programming easier and more efficient, as we will illustrate with the construction of an office directory in Chapter 5.
Like awk, cut, and join, sort views its input as a stream of records made up of fields of variable width, with records delimited by newline characters and fields delimited by whitespace or a user-specifiable single character.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Sorting Text
Text files that contain independent records of data are often candidates for sorting. A predictable record order makes life easier for human users: book indexes, dictionaries, parts catalogs, and telephone directories have little value if they are unordered. Sorted records can also make programming easier and more efficient, as we will illustrate with the construction of an office directory in Chapter 5.
Like awk, cut, and join, sort views its input as a stream of records made up of fields of variable width, with records delimited by newline characters and fields delimited by whitespace or a user-specifiable single character.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Removing Duplicates
It is sometimes useful to remove consecutive duplicate records from a data stream. We showed in Section 4.1.2 that sort -u would do that job, but we also saw that the elimination is based on matching keys rather than matching records. The uniq command provides another way to filter data: it is frequently used in a pipeline to eliminate duplicate records downstream from a sort operation:
sort ... | uniq | ...
uniq has three useful options that find frequent application. The -c option prefixes each output line with a count of the number of times that it occurred, and we will use it in the word-frequency filter in Example 5-5 in Chapter 5. The -d option shows only lines that are duplicated, and the -u option shows just the nonduplicate lines. Here are some examples:
$ cat latin-numbers                      
            Show the test file
tres
unus
duo
tres
duo
tres

$ sort latin-numbers | uniq              
            Show unique sorted records
duo
tres
unus

$ sort latin-numbers | uniq -c           
            Count unique sorted records
      2 duo
      3 tres
      1 unus

$ sort latin-numbers | uniq -d           
            Show only duplicate records
duo
tres

$ sort latin-numbers | uniq -u           
            Show only nonduplicate records
unus
uniq is sometimes a useful complement to the diff utility for figuring out the differences between two similar data streams: dictionary word lists, pathnames in mirrored directory trees, telephone books, and so on. Most implementations have other options that you can find described in the manual pages for uniq(1), but their use is rare. Like sort, uniq is standardized by POSIX, so you can use it everywhere.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reformatting Paragraphs
Most powerful text editors provide commands that make it easy to reformat paragraphs by changing line breaks so that lines do not exceed a width that is comfortable for a human to read; we used such commands a lot in writing this book. Sometimes you need to do this to a data stream in a shell script, or inside an editor that lacks a reformatting command but does have a shell escape. In this case, fmt is what you need. Although POSIX makes no mention of fmt, you can find it on every current flavor of Unix; if you have an older system that lacks fmt, simply install the GNU coreutils package.
Although some implementations of fmt have more options, only two find frequent use: -s means split long lines only, but do not join short lines to make longer ones, and -w n sets the output line width to n characters (default: usually about 75 or so). Here are some examples with chunks of a spelling dictionary that has just one word per line:
$ sed -n -e 9991,10010p /usr/dict/words | fmt        
            Reformat 20 dictionary words
Graff graft graham grail grain grainy grammar grammarian grammatic
granary grand grandchild grandchildren granddaughter grandeur grandfather
grandiloquent grandiose grandma grandmother

$ sed -n -e 9995,10004p /usr/dict/words | fmt -w 30  
            Reformat 10 words into short lines
grain grainy grammar
grammarian grammatic
granary grand grandchild
grandchildren granddaughter
If your system does not have /usr/dict/words, then it probably has an equivalent file named /usr/share/dict/words or /usr/share/lib/dict/words.
The split-only option, -s, is helpful in wrapping long lines while leaving short lines intact, and thus minimizing the differences from the original version:
$ fmt -s -w 10 << END_OF_DATA                        
            Reformat long lines only
> one two three four five
> six
> seven 
> eight
> END_OF_DATA
one two
three
four five
six
seven
eight
You might expect that you could split an input stream into one word per line with
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Counting Lines, Words, and Characters
We have used the word-count utility, wc, a few times before. It is probably one of the oldest, and simplest, tools in the Unix toolbox, and POSIX standardizes it. By default, wc outputs a one-line report of the number of lines, words, and bytes:
$ echo This is a test of the emergency broadcast system | wc    
            Report counts
      1       9      49
Request a subset of those results with the -c (bytes), -l (lines), and -w (words) options:
$ echo Testing one two three | wc -c     
            Count bytes
22

$ echo Testing one two three | wc -l     
            Count lines
1

$ echo Testing one two three | wc -w     
            Count words
4
The -c option originally stood for character count, but with multibyte character-set encodings, such as UTF-8, in modern systems, bytes are no longer synonymous with characters, so POSIX introduced the -m option to count multibyte characters. For 8-bit character data, it is the same as -c.
Although wc is most commonly used with input from a pipeline, it also accepts command-line file arguments, producing a one-line report for each, followed by a summary report:
$ wc /etc/passwd /etc/group              
            Count data in two files
    26     68   1631 /etc/passwd
 10376  10376 160082 /etc/group
 10402  10444 161713 total
Modern versions of wc are locale-aware: set the environment variable LC_CTYPE to the desired locale to influence wc's interpretation of byte sequences as characters and word separators.
In Chapter 5, we will develop a related tool, wf, to report the frequency of occurrence of each word.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thi