Chapter 4. Basic Shell Programming
If you have become familiar with the customization techniques we presented in the previous chapter, you have probably run into various modifications to your environment that you want to make but can’t — yet. Shell programming makes these possible.
The Korn shell has some of the most advanced programming capabilities of any command interpreter of its type. Although its syntax is nowhere near as elegant or consistent as that of most conventional programming languages, its power and flexibility are comparable. In fact, the Korn shell can be used as a complete environment for writing software prototypes.
Some aspects of Korn shell programming are really extensions of the customization techniques we have already seen, while others resemble traditional programming language features. We have structured this chapter so that if you aren’t a programmer, you can read this chapter and do quite a bit more than you could with the information in the previous chapter. Experience with a conventional programming language like Pascal or C is helpful (though not strictly necessary) for subsequent chapters. Throughout the rest of the book, we will encounter occasional programming problems, called tasks, whose solutions make use of the concepts we cover.
Shell Scripts and Functions
A script, or file that contains shell commands, is a shell program. Your .profile and environment files, discussed in Chapter 3, are shell scripts.
You can create a script using the
text editor of your choice. Once you have created one, there are a number of ways to
run it.
One, which we have already covered, is to
type .
scriptname
(i.e., the command is a dot).
This causes the commands in the script to be read and run as if you typed them in.
Two more ways are to type ksh
script
or ksh
<
script
. These explicitly invoke the Korn
shell on the script, requiring that you (and your users) be aware that they are
scripts.
The final way to run a script is simply to type its name and hit ENTER, just as if you were invoking a built-in command. This, of course, is the most convenient way. This method makes the script look just like any other Unix command, and in fact several “regular” commands are implemented as shell scripts (i.e., not as programs originally written in C or some other language), including spell, man on some systems, and various commands for system administrators. The resulting lack of distinction between “user command files” and “built-in commands” is one factor in Unix’s extensibility and, hence, its favored status among programmers.
You can run a script by typing its name only if .
(the
current directory) is part of your command search path, i.e., is included in your
PATH
variable (as discussed in Chapter 3).
If .
isn’t on your path, you must type ./
scriptname, which is really the same thing as
typing the script’s relative pathname (see Chapter 1).
Before you can invoke the shell script by name, you must also give it “execute” permission. If you are familiar with the Unix filesystem, you know that files have three types of permissions (read, write, and execute) and that those permissions apply to three categories of user (the file’s owner, a group of users, and everyone else). Normally, when you create a file with a text editor, the file is set up with read and write permission for you and read-only permission for everyone else.[46]
Therefore you must give your script execute permission explicitly, by using the chmod(1) command. The simplest way to do this is like so:
chmod +x scriptname
Your text editor preserves this permission if you make subsequent changes to your script. If you don’t add execute permission to the script, and you try to invoke it, the shell prints the message:
ksh: scriptname
: cannot execute [Permission denied]
But there is a more important difference between the two ways of running shell scripts. While the “dot” method causes the commands in the script to be run as if they were part of your login session, the “just the name” method causes the shell to do a series of things. First, it runs another copy of the shell as a subprocess. The shell subprocess then takes commands from the script, runs them, and terminates, handing control back to the parent shell.
Figure 4-1 shows how the shell
executes scripts. Assume you have a simple shell script called fred that contains the commands bob and dave. In Figure 4-1.a, typing . fred
causes the two commands to run in the same shell,
just as if you had typed them in by hand. Figure 4-1.b shows what happens when you type just fred
: the commands run in the shell subprocess while the parent shell
waits for the subprocess to finish.
You may find it interesting to compare this with the situation in Figure 4-1.c, which shows what happens when you type fred &
. As you will recall
from Chapter
1, the &
makes the command run in the background, which is really just another term for
“subprocess.” It turns out that the only significant difference between Figure 4-1.c and Figure 4-1.b is that you have
control of your terminal or workstation while the command runs — you need not wait
until it finishes before you can enter further commands.
There are many ramifications to using shell subprocesses. An important one is that the exported environment variables that we saw in the last chapter
(e.g., TERM
, LOGNAME
, PWD
) are known in shell subprocesses, whereas other shell
variables (such as any that you define in your .profile without an export statement) are not.
Other issues involving shell subprocesses are too complex to go into now; see Chapter 7 and Chapter 8 for more details about subprocess I/O and process characteristics, respectively. For now, just bear in mind that a script normally runs in a shell subprocess.
Functions
The Korn shell’s function feature is an expanded version of a similar facility in the System V Bourne shell and a few other shells. A function is sort of a script-within-a-script; you use it to define some shell code by name and store it in the shell’s memory, to be invoked and run later.
Functions improve the shell’s programmability significantly, for two main reasons. First, when you invoke a function, it is already in the shell’s memory (except for automatically loaded functions; see Section 4.1.1.1, later in this chapter); therefore a function runs faster. Modern computers have plenty of memory, so there is no need to worry about the amount of space a typical function takes up. For this reason, most people define as many functions as possible rather than keep lots of scripts around.
The other advantage of functions is that they are ideal for organizing long shell scripts into modular “chunks” of code that are easier to develop and maintain. If you aren’t a programmer, ask one what life would be like without functions (also called procedures or subroutines in other languages) and you’ll probably get an earful.
To define a function, you can use either one of two forms:
functionfunctname
{ Korn shell semanticsshell commands
}
or:
functname
() { POSIX semanticsshell commands
}
The first form provides access to the
full power and programmability of the Korn shell. The second is compatible with
the syntax for shell functions introduced in the System V Release 2 Bourne shell.
This form obeys the semantics of the POSIX standard, which are less powerful than
full Korn shell-style functions. (We discuss the differences in detail shortly.)
We always use the first form in this book. You can delete a function definition with the command unset -f
functname
.
When you define a function, you tell the shell to store its name and definition (i.e., the shell commands it contains) in memory. If you want to run the function later, just type in its name followed by any arguments, as if it were a shell script.
You can find out what functions are defined in your login session by typing functions
.[47]
(Note the s
at the end of the command name.) The
shell will print not just the names but also the definitions of all functions, in
alphabetical order by function name. Since this may result in long output, you
might want to pipe the output through more
or redirect it to a file for examination with a text editor.
Apart from the advantages, there are two important differences between functions and scripts. First, functions do not run in separate processes, as scripts do when you invoke them by name; the “semantics” of running a function are more like those of your .profile when you log in or any script when invoked with the “dot” command. Second, if a function has the same name as a script or executable program, the function takes precedence.
This is a good time to show the order of precedence for the various sources of commands. When you type a command to the shell, it looks in the following places until it finds a match:
Keywords, such as
function
and several others (e.g.,if
andfor
) that we will see in Chapter 5Aliases (although you can’t define an alias whose name is a shell keyword, you can define an alias that expands to a keyword, e.g.,
alias aslongas=while
; see Chapter 7 for more details)Special built-ins, such as break and continue (the full list is . (dot), :, alias, break, continue, eval, exec, exit, export, login, newgrp, readonly, return, set, shift, trap, typeset, unalias, and unset)
Functions
Non-special built-ins, such as cd and whence
Scripts and executable programs, for which the shell searches in the directories listed in the
PATH
environment variable
We’ll examine this process in more detail in the section on command-line processing in Chapter 7.
If you need to know the exact source
of a command, there is an option to the whence built-in command that we saw in Chapter 3.
whence by itself will print the pathname
of a command if the command is a script or executable program, but it will only
parrot the command’s name back if it is anything else. But if you type whence -v
commandname
, you get more complete
information, such as:
$whence -v cd
cd is a shell builtin $whence -v function
function is a keyword $whence -v man
man is a tracked alias for /usr/bin/man $whence -v ll
ll is an alias for 'ls -l'
For compatibility with the System V Bourne shell, the Korn shell predefines the
alias type='whence -v'
. This definitely makes the
transition to the Korn shell easier for long-time Bourne shell users; type is similar to whence. The whence
command actually has several options, described in Table 4-1.
Option | Meaning |
-a
|
Print all interpretations of given name. |
-f
|
Skip functions in search for name. |
-p
|
Search $PATH , even if name is a built-in or function. |
-v
|
Print more verbose description of name. |
Throughout the remainder of this book we refer mainly to scripts, but unless we note otherwise, you should assume that whatever we say applies equally to functions.
Automatically loading functions
At first glance, it would seem that the best place to put your own function definitions is in your .profile or environment file. This is great for interactive use, since your login shell reads ~/.profile, and other interactive shells read the environment file. However, any shell scripts that you write don’t read either file. Furthermore, as your collection of functions grows, so too do your initialization files, making them hard to work with.
ksh93 works around both of these issues by integrating the search for functions with the search for commands. Here’s how it works:
Create a directory to hold your function definitions. This can be your private bin directory, or you may wish to have a separate directory, such as ~/funcs. For the sake of discussion, assume the latter.
In your .profile file, add this directory to both the variables
PATH
andFPATH
:PATH=$PATH:~/funcs FPATH=~/funcs export PATH FPATH
In ~/funcs, place the definition of each of your functions into a separate file. Each function’s file should have the same name as the function:
$
mkdir ~/funcs
$cd ~/funcs
$cat > whoson
# whoson --- create a sorted list of logged-on users
function whoson {
who | awk '{ print $1 }' | sort -u
}
^D
Now, the first time you type whoson
, the shell
looks for a command named whoson using
the search order described earlier. It will not be found as a special-built-in,
as a function, or as a regular built-in. The shell then starts a search along
$PATH
. When it finally finds ~/funcs/whoson, the shell notices that
~/funcs is also in $FPATH
. (“Aha!” says the shell.) When this is the
case, the shell expects to find the definition of the function named whoson inside the file. It reads and
executes the entire contents of the file and only then runs the function whoson, with any supplied arguments. (If
the file found in both $PATH
and $FPATH
doesn’t actually define the function, you’ll
get a “not found” error message.)
The next time you type whoson
, the function is
already defined, so the shell finds it immediately, without the need for the
path search.
Note that directories listed in FPATH
but not in
PATH
won’t be searched for functions, and that
as of ksh93l, the current directory must
be listed in FPATH
via an explicit dot; a leading
or trailing colon doesn’t cause the current directory to be searched.
As a final wrinkle, starting with ksh93m, each directory named in PATH
may contain a file named .paths. This file may contain comments and blank lines, and
specialized variable assignments. The first allowed assignment is to FPATH
, where the value should name an existing
directory. If that directory contains a file whose name matches the function
being searched for, that file is read and executed as if via the .
(dot) command, and then the function is executed.
In addition, one other environment variable may be assigned to. The intended
use of this is to specify a relative or absolute path for a library directory
containing the shared libraries for executables in the current bin directory.
On many Unix systems, this variable is LD_LIBRARY_PATH
, but some systems have a different variable — check
your local documentation. The given value is prepended to the existing value of
the variable when the command is executed. (This mechanism may open security
holes. System administrators should use it with caution!)
For example, the AT&T Advanced Software Tools group that distributes ksh93 also has many other tools, often
installed in a separate ast/bin
directory. This feature allows the ast
programs to find their shared libraries, without the user having to manually
adjust LD_LIBRARY_PATH
in the .profile file.[48] For example, if a command is found in /usr/local/ast/bin, and the .paths file in that directory contains the assignment
LD_LIBRARY_PATH=../lib
, the shell prepends
/usr/local/ast/lib:
to the value of LD_LIBRARY_PATH
before running the command.
Readers familiar with ksh88 will notice
that this part of the shell’s behavior has changed significantly. Since ksh88 always read the environment file,
whether or not the shell was interactive, it was simplest to just put function
definitions there. However, this could still yield a large, unwieldy file. To
get around this, you could create files in one or more directories listed in
$FPATH
. Then, in the environment file, you
would mark the functions as being autoloaded:
autoload whoson ...
Marking a function with autoload
[49] tells the shell that this name is a
function, and to find the definition by searching $FPATH
. The advantage to this is that the function is not loaded
into the shell’s memory if it’s not needed. The disadvantage is that you have
to explicitly list all your functions in your environment file.
ksh93’s integration of PATH
and FPATH
searching
thus simplifies the way you add shell functions to your personal shell function
“library.”
POSIX functions
As mentioned earlier, functions defined using the POSIX syntax obey POSIX semantics and not Korn shell semantics:
functname
() {shell commands
}
The best way to understand this is to think of a POSIX function as being like a dot script. Actions within the body of the function affect all the state of the current script. In contrast, Korn shell functions have much less shared state with the parent shell, although they are not identical to totally separate scripts.
The technical details follow; they include information that we haven’t covered yet. So come back and reread this section after you’ve learned about the typeset command in Chapter 6 and about traps in Chapter 8.
POSIX functions share variables with the parent script. Korn shell functions can have their own local variables.
POSIX functions share traps with the parent script. Korn shell functions can have their own local traps.
POSIX functions cannot be recursive (call themselves).[50] Korn shell functions can.
When a POSIX function is run,
$0
is not changed to the name of the function.
If you use the dot command with the name of a Korn shell function, that function will obey POSIX semantics, affecting all the state (variables and traps) of the parent shell:
$function demo {
Define a Korn shell function >typeset myvar=3
Set a local variable myvar >print "demo: myvar is $myvar"
>}
$myvar=4
Set the global myvar $demo ; print "global: myvar is $myvar"
Run the function demo: myvar is 3 global: myvar is 4 $. demo
Run with POSIX semantics demo: myvar is 3 $print "global: myvar is $myvar"
See the results global: myvar is 3
Shell Variables
A major piece of the Korn shell’s
programming functionality relates to shell variables. We’ve already seen the basics
of variables. To recap briefly: they are named places to store data, usually in the
form of character strings, and their values can be obtained by preceding their names
with dollar signs ($
). Certain variables, called environment variables, are conventionally named
in all capital letters, and their values are made known (with the export statement) to subprocesses.
This section presents the basics for shell variables. Discussion of certain advanced features is delayed until later in the chapter, after covering regular expressions.
If you are a programmer, you already know that just about every major programming language uses variables in some way; in fact, an important way of characterizing differences between languages is comparing their facilities for variables.
The chief difference between the Korn shell’s variable schema and those of conventional languages is that the Korn shell’s schema places heavy emphasis on character strings. (Thus it has more in common with a special-purpose language like SNOBOL than a general-purpose one like Pascal.) This is also true of the Bourne shell and the C shell, but the Korn shell goes beyond them by having additional mechanisms for handling integers and double-precision floating point numbers explicitly, as well as simple arrays.
Positional Parameters
As we have already seen, you can
define values for variables with statements of the form varname
=
value
, e.g.:
$fred=bob
$print "$fred"
bob
Some environment variables are predefined by the shell when you log in. There are other built-in variables that are vital to shell programming. We look at a few of them now and save the others for later.
The most important special,
built-in variables are called positional
parameters. These hold the command-line arguments to scripts
when they are invoked. Positional parameters have names 1
, 2
, 3
,
etc., meaning that their values are denoted by $1
,
$2
, $3
, etc. There is
also a positional parameter 0
, whose value is the
name of the script (i.e., the command typed in to invoke it).
Two special variables contain all of
the positional parameters (except positional parameter 0
): *
and @
.
The difference between them is subtle but important, and it’s apparent only when
they are within double quotes.
"$*"
is a single string that consists of all of the
positional parameters, separated by the first character in the variable IFS
(internal field separator), which is a space, TAB,
and newline by default. On the other hand, "$@"
is
equal to "$1"
"$2"
... "$
N
"
, where N is
the number of positional parameters. That is, it’s equal to N separate double-quoted strings, which are separated by
spaces. We’ll explore the ramifications of this difference in a little while.
The variable #
holds the number of positional parameters (as a character string).
All of these variables are “read-only,” meaning that you can’t assign new values
to them within scripts. (They can be changed, just not via assignment. See Section 4.2.1.2, later in this
chapter.)
For example, assume that you have the following simple shell script:
print "fred: $*" print "$0: $1 and $2" print "$# arguments"
Assume further that the script is called fred. Then if you type fred bob
dave
, you will see the following output:
fred: bob dave fred: bob and dave 2 arguments
In this case, $3
, $4
, etc., are all unset, which means
that the shell substitutes the empty (or null) string for them (unless the option
nounset is turned on).
Positional parameters in functions
Shell functions use positional
parameters and special variables like *
and #
in exactly the same way that shell scripts do. If
you wanted to define fred as a function,
you could put the following in your .profile or environment file:
function fred { print "fred: $*" print "$0: $1 and $2" print "$# arguments" }
You get the same result if you type fred bob
dave
.
Typically, several shell functions are defined within a single shell script. Therefore each function needs to handle its own arguments, which in turn means that each function needs to keep track of positional parameters separately. Sure enough, each function has its own copies of these variables (even though functions don’t run in their own subprocess, as scripts do); we say that such variables are local to the function.
Other variables defined within functions are not local; they are global, meaning that their values are known throughout the entire shell script.[51] For example, assume that you have a shell script called ascript that contains this:
function afunc { print in function $0: $1 $2 var1="in function" } var1="outside of function" print var1: $var1 print $0: $1 $2 afunc funcarg1 funcarg2 print var1: $var1 print $0: $1 $2
If you invoke this script by typing ascript arg1
arg2
, you will see this output:
var1: outside of function ascript: arg1 arg2 in function afunc: funcarg1 funcarg2 var1: in function ascript: arg1 arg2
In other words, the function afunc
changes the value of the variable var1
from
“outside of function” to “in function,” and that change is known outside the
function, while $0
, $1
, and $2
have different values in the
function and the main script. Figure
4-2 shows this graphically.
It is possible to make other
variables local to functions by using the typeset command, which we’ll see in Chapter 6. Now that we have this background, let’s take a closer look
at "$@"
and "$*"
.
These variables are two of the shell’s greatest idiosyncracies, so we’ll
discuss some of the most common sources of confusion.
Why are the elements of
"$*"
separated by the first character ofIFS
instead of just spaces? To give you output flexibility. As a simple example, let’s say you want to print a list of positional parameters separated by commas. This script would do it:IFS=, print "$*"
Changing
IFS
in a script is fairly risky, but it’s probably OK as long as nothing else in the script depends on it. If this script were called arglist, the commandarglist bob dave ed
would produce the outputbob,dave,ed
. Chapter 10 contains another example of changingIFS
.Why does
"$@"
act like N separate double-quoted strings? To allow you to use them again as separate values. For example, say you want to call a function within your script with the same list of positional parameters, like this:function countargs { print "$# args." }
Assume your script is called with the same arguments as arglist above. Then if it contains the command
countargs "$*"
, the function prints1 args
. But if the command iscountargs "$@"
, the function prints3 args
.Being able to retrieve the arguments as they came in is also important in case you need to preserve any embedded white space. If your script was invoked with the arguments “hi”, “howdy”, and “hello there”, here are the different results you might get:
$
countargs $*
4 args $countargs "$*"
1 args $countargs $@
4 args $countargs "$@"
3 argsBecause
"$@"
always exactly preserves arguments, we use it in just about all the example programs in this book.
Changing the positional parameters
Occasionally, it’s useful to change the positional parameters. We’ve already
mentioned that you cannot set them directly, using an assignment such as 1="first"
. However, the built-in command set can be used for this purpose.
The set command is perhaps the single most complicated and
overloaded command in the shell. It takes a large number of options, which are
discussed in Chapter 9. What we care
about for the moment is that additional non-option arguments to set replace the positional parameters.
Suppose our script was invoked with the three arguments “bob”, “fred”, and
“dave”. Then countargs "$@"
tells us that we have
three arguments. Upon using set to
change the positional parameters, $#
is updated
too.
$set one two three "four not five"
Change the positional parameters $countargs "$@"
Verify the change 4 args
The set command also works inside a shell function. The shell function’s positional parameters are changed, but not those of the calling script:
$function testme {
>countargs "$@"
Show the original number of parameters >set a b c
Now change them >countargs "$@"
Print the new count >}
$testme 1 2 3 4 5 6
Run the function 6 args Original count 3 args New count $countargs "$@"
No change to invoking shell's parameters 4 args
More on Variable Syntax
Before we show the many things you can do
with shell variables, we have to make a confession: the syntax of $
varname
for taking the value of a
variable is not quite accurate. Actually, it’s the simple form of the more general
syntax, which is ${
. varname
}
Why two syntaxes? For one thing, the
more general syntax is necessary if your code refers to more than nine positional
parameters: you must use ${10}
for the tenth instead
of $10
. (This ensures compatibility with the Bourne
shell, where $10
means ${1}0
.) Aside from that,
consider the Chapter 3 example of
setting your primary prompt variable (PS1
) to your
login name:
PS1="($LOGNAME)-> "
This happens to work because the right
parenthesis immediately following LOGNAME
isn’t a
valid character for a variable name, so the shell doesn’t mistake it for part of
the variable name. Now suppose that, for some reason, you want your prompt to be
your login name followed by an underscore. If you type:
PS1="$LOGNAME_ "
then the shell tries to use “LOGNAME_” as the name of the variable, i.e., to take
the value of $LOGNAME_
. Since there is no such
variable, the value defaults to null (the
empty string, “”), and PS1
is set just to a single
space.
For this reason, the full syntax for taking the value of a variable is ${
varname
}
. So if we used:
PS1="${LOGNAME}_ "
we would get the desired yourname
_
. It is safe to omit the curly braces ({}
) if the variable name is followed by a character that
isn’t a letter, digit, or underscore.
Appending to a Variable
As mentioned, Korn shell variables tend to be string-oriented. One operation that’s very common is to append a new value onto an existing variable. (For example, collecting a set of options into a single string.) Since time immemorial, this was done by taking advantage of variable substitution inside double quotes:
myopts="$myopts $newopt"
The values of myopts
and newopt
are concatenated together into a single string, and the result
is then assigned back to myopts
. Starting with ksh93j, the Korn shell provides a more
efficient and intuitive mechanism for doing this:
myopts+=" $newopt"
This accomplishes the same thing, but it is more efficient, and it also makes it
clear that the new value is being added onto the string. (In C, the +=
operator adds the value on the right to the variable
on the left; x += 42
is the same as x = x + 42
.)
Compound Variables
ksh93 introduces a new feature, called compound variables. They are similar in nature to
a Pascal or Ada record
or a C struct
, and they allow you to group related items together under the same
name. Here are some examples:
now="May 20 2001 19:44:57" Assign current date to variable now now.hour=19 Set the hour now.minute=44 Set the minute ...
Note the use of the period in the variable’s name. Here, now
is called the parent variable,
and it must exist (i.e., have a value) before you can assign a value to an individual
component (such as hour
or minute
). To access a compound variable, you must enclose the variable’s
name in curly braces. If you don’t, the period ends the shell’s scan for the
variable’s name:
$print ${now.hour}
19 $print $now.hour
May 20 2001 19:44:57.hour
Compound Variable Assignment
Assigning to individual elements of a compound variable is tedious. In particular the requirement that the parent variable exist previously leads to an awkward programming style:
person="John Q. Public" person.firstname=John person.initial=Q. person.lastname=Public
Fortunately, you can use a compound assignment to do it all in one fell swoop:
person=(firstname=John initial=Q. lastname=Public)
You can retrieve the value of either the entire variable, or a component, using print.
$print $person
Simple print ( lastname=Public initial=Q. firstname=John ) $print -r "$person"
Print in full glory ( lastname=Public initial=Q. firstname=John ) $print ${person.initial}
Print just the middle initial Q.
The second print command preserves the whitespace that the Korn shell provides when returning the value of a compound variable. The -r option to print is discussed in Chapter 7.
Note
The order of the components is different from what was used in the initial assignment. This order depends upon how the Korn shell manages compound variables internally and cannot be controlled by the programmer.
A second assignment syntax exists, similar to the first:
person=(typeset firstname=John initial=Q. lastname=Public ; typeset -i age=42)
By using the typeset command, you can
specify that a variable is a number instead of a string. Here, person.age
is an integer variable. The rest remain
strings. The typeset command and its
options are presented in Chapter 6. (You
can also use readonly to declare that a
component variable cannot be changed.)
Just as you may use +=
to append to a regular
variable, you can add components to a compound variable as well:
person+= (typeset spouse=Jane)
A space is allowed after the =
but not before. This
is true for compound assignments with both =
and
+=
.
The Korn shell has additional syntaxes for compound assignment that apply only to array variables; they are also discussed in Chapter 6.
Finally, we’ll mention that the Korn shell has a special compound variable named
.sh
. The various components almost all relate to
features we haven’t covered yet, except ${.sh.version}
, which tells you the version of the Korn shell that you
have:
$ print ${.sh.version}
Version M 1993-12-28 m
We will see another component of .sh
later in this
chapter, and the other components are covered as we introduce the features they
relate to.
Indirect Variable References (namerefs)
Most of the time, as we’ve seen so far,
you manipulate variables directly, by name (x=1
, for
example). The Korn shell allows you to manipulate variables indirectly, using something called a nameref. You create a nameref using typeset -n, or the more convenient predefined alias, nameref. Here is a simple example:
$name="bill"
Set initial value $nameref firstname=name
Set up the nameref $print $firstname
Actually references variable name bill $firstname="arnold"
Now change the indirect reference $print $name
Shazzam! Original variable is changed arnold
To find out the name of the real variable being referenced by the nameref, use ${!
variable
}
:
$ print ${!firstname}
name
At first glance, this doesn’t seem to be very useful. The power of namerefs comes into play when you pass a variable’s name to a function, and you want that function to be able to update the value of that variable. The following example illustrates how it works:
$date
Current day and time Wed May 23 17:49:44 IDT 2001 $function getday {
Define a function >typeset -n day=$1
Set up the nameref >day=$(date | awk '{ print $1 }')
Actually change it >}
$today=now
Set initial value $getday today
Run the function $print $today
Display new value Wed
The default output of date(1) looks like this:
$ date
Wed Nov 14 11:52:38 IST 2001
The getday function uses awk to print the first field, which is the day of
the week. The result of this operation, which is done inside command substitution
(described later in this chapter), is assigned to the local variable day
. But day
is a nameref; the
assignment actually updates the global variable today
.
Without the nameref facility, you have to resort to advanced tricks like using eval (see Chapter 7) to make
something like this happen.
To remove a nameref, use unset -n
, which removes the
nameref itself, instead of unsetting the variable the nameref is a reference to.
Finally, note that variables that are namerefs may not have periods in their names
(i.e., be components of a compound variable). They may, though, be references to a
compound variable.
String Operators
The curly-brace syntax allows for the shell’s string operators. String operators allow you to manipulate values of variables in various useful ways without having to write full-blown programs or resort to external Unix utilities. You can do a lot with string-handling operators even if you haven’t yet mastered the programming features we’ll see in later chapters.
In particular, string operators let you do the following:
Ensure that variables exist (i.e., are defined and have non-null values)
Set default values for variables
Catch errors that result from variables not being set
Remove portions of variables’ values that match patterns
Syntax of String Operators
The basic idea behind the syntax of string operators is that special characters that denote operations are inserted between the variable’s name and the right curly brace. Any argument that the operator may need is inserted to the operator’s right.
The first group of string-handling operators tests for the existence of variables and allows substitutions of default values under certain conditions. These are listed in Table 4-2.
Operator | Substitution |
${
varname
:-
word
}
|
If varname exists and isn’t null, return its value; otherwise return word. |
Purpose: |
Returning a default value if the variable is undefined. |
Example: |
|
${
varname
:=
word
}
|
If varname exists and isn’t null, return its value; otherwise set it to word and then return its value.[a] |
Purpose: |
Setting a variable to a default value if it is undefined. |
Example: |
|
${
varname
:?
message
}
|
If varname exists and
isn’t null, return its value; otherwise print |
Purpose: |
Catching errors that result from variables being undefined. |
Example: |
|
${
varname
:+
word
}
|
If varname exists and isn’t null, return word; otherwise return null. |
Purpose: |
Testing for the existence of a variable. |
Example: |
|
[a] Pascal, Modula, and Ada programmers may find it helpful to recognize the similarity of this to the assignment operators in those languages. |
The colon (:
) in each of these operators is actually optional. If the colon is
omitted, then change “exists and isn’t null” to “exists” in each definition, i.e.,
the operator tests for existence only.
The first two of these operators are ideal for setting defaults for command-line arguments in case the user omits them. We’ll actually use all four in Task 4-1, which is our first programming task.
By far the best approach to this type of script is to use built-in Unix utilities, combining them with I/O redirectors and pipes. This is the classic “building-block” philosophy of Unix that is another reason for its great popularity with programmers. The building-block technique lets us write a first version of the script that is only one line long:
sort -nr "$1" | head -${2:-10}
Here is how this works: the sort(1) program sorts the data in the file
whose name is given as the first argument ($1
). (The
double quotes allow for spaces or other unusual characters in file names, and also
prevent wildcard expansion.) The -n option
tells sort to interpret the first word on
each line as a number (instead of as a character string); the -r tells it to reverse the comparisons, so as
to sort in descending order.
The output of sort is piped into the head(1) utility, which, when given the
argument -N, prints the first N lines of its input on the standard output.
The expression -${2:-10}
evaluates to a dash (-
) followed by the second argument, if it is given, or
to 10
if it’s not; notice that the variable in this
expression is 2
, which is the second positional
parameter.
Assume the script we want to write is called highest. Then if the user types highest myfile
, the line that actually runs is:
sort -nr myfile | head -10
Or if the user types highest myfile 22
, the line
that runs is:
sort -nr myfile | head -22
Make sure you understand how the :-
string operator
provides a default value.
This is a perfectly good, runnable script — but it has a few problems. First, its one line is a bit cryptic. While this isn’t much of a problem for such a tiny script, it’s not wise to write long, elaborate scripts in this manner. A few minor changes makes the code more readable.
First, we can add comments to the code; anything between # and the end of a line is a comment. At minimum, the script should start with a few comment lines that indicate what the script does and the arguments it accepts. Next, we can improve the variable names by assigning the values of the positional parameters to regular variables with mnemonic names. Last, we can add blank lines to space things out; blank lines, like comments, are ignored. Here is a more readable version:
# highest filename [howmany] # # Print howmany highest-numbered lines in file filename. # The input file is assumed to have lines that start with # numbers. Default for howmany is 10. filename=$1 howmany=${2:-10} sort -nr "$filename" | head -$howmany
The square brackets around howmany
in the comments
adhere to the convention in Unix documentation that square brackets denote
optional arguments.
The changes we just made improve the code’s readability but not how it runs. What
if the user invoked the script without any arguments? Remember that positional
parameters default to null if they aren’t defined. If there are no arguments, then
$1
and $2
are both
null. The variable howmany
($2
) is set up to default to 10, but there is no default for filename
($1
). The result
would be that this command runs:
sort -nr | head -10
As it happens, if sort is called without a filename argument, it expects input to come from standard input, e.g., a pipe (|) or a user’s keyboard. Since it doesn’t have the pipe, it will expect the keyboard. This means that the script will appear to hang! Although you could always type CTRL-D or CTRL-C to get out of the script, a naive user might not know this.
Therefore we need to make sure that the user supplies at least one argument. There are a few ways of doing this; one of them involves another string operator. We’ll replace the line:
filename=$1
with:
filename=${1:?"filename missing."}
This causes two things to happen if a user invokes the script without any arguments: first, the shell prints the somewhat unfortunate message to the standard error output:
highest: line 1: : filename missing.
Second, the script exits without running the remaining code.
With a somewhat “kludgy” modification, we can get a slightly better error message. Consider this code:
filename=$1 filename=${filename:?"missing."}
This results in the message:
highest: line 2: filename: filename missing.
(Make sure you understand why.) Of course, there are ways of printing whatever message is desired; we’ll find out how in Chapter 5.
Before we move on, we’ll look more closely at the two remaining operators in Table 4-2 and see how we can
incorporate them into our task solution. The :=
operator does roughly the same thing as :-
, except
that it has the side effect of setting the value of the variable to the given word
if the variable doesn’t exist.
Therefore we would like to use :=
in our script in
place of :-
, but we can’t; we’d be trying to set the
value of a positional parameter, which is not allowed. But if we replaced:
howmany=${2:-10}
with just:
howmany=$2
and moved the substitution down to the actual command line (as we did at the
start), then we could use the :=
operator:
sort -nr "$filename" | head -${howmany:=10}
Using :=
has the added benefit of setting the value
of howmany
to 10 in case we need it afterwards in
later versions of the script.
The final substitution operator is :+
. Here is how
we can use it in our example: let’s say we want to give the user the option of
adding a header line to the script’s output. If he types the option -h, the output will be preceded by the line:
ALBUMS ARTIST
Assume further that this option ends up in the variable header
, i.e., $header
is -h
if the option is set or null if not. (Later we see
how to do this without disturbing the other positional parameters.)
The expression:
${header:+"ALBUMS ARTIST\n"}
yields null if the variable header
is null or ALBUMS ARTIST\n
if it is non-null. This means that we
can put the line:
print -n ${header:+"ALBUMS ARTIST\n"}
right before the command line that
does the actual work. The -n option to
print causes it not to print a newline after printing its arguments.
Therefore this print statement prints
nothing — not even a blank line — if header
is null;
otherwise it prints the header line and a newline (\n
).
Patterns and Regular Expressions
We’ll continue refining our
solution to Task 4-1 later in this chapter. The next type of string
operator is used to match portions of a variable’s string value against patterns. Patterns, as we saw in Chapter
1, are strings that can contain wildcard characters (*
, ?
, and []
for character sets and ranges).
Wildcards have been standard features of all Unix shells going back (at least) to the Version 6 Thompson shell.[52] But the Korn shell is the first shell to add to their capabilities. It adds a set of operators, called regular expression (or regexp for short) operators, that give it much of the string-matching power of advanced Unix utilities like awk(1), egrep(1) (extended grep(1)), and the Emacs editor, albeit with a different syntax. These capabilities go beyond those that you may be used to in other Unix utilities like grep, sed(1), and vi(1).
Advanced Unix users will find the Korn shell’s regular expression capabilities useful for script writing, although they border on overkill. (Part of the problem is the inevitable syntactic clash with the shell’s myriad other special characters.) Therefore we won’t go into great detail about regular expressions here. For more comprehensive information, the “very last word” on practical regular expressions in Unix is Mastering Regular Expressions, by Jeffrey E. F. Friedl. A more gentle introduction may found in the second edition of sed & awk, by Dale Dougherty and Arnold Robbins. Both are published by O’Reilly & Associates. If you are already comfortable with awk or egrep, you may want to skip the following introductory section and go to Section 4.5.2.3, later in this chapter, where we explain the shell’s regular expression mechanism by comparing it with the syntax used in those two utilities. Otherwise, read on.
Regular expression basics
Think of regular expressions as strings that match patterns more powerfully than the standard shell wildcard schema. Regular expressions began as an idea in theoretical computer science, but they have found their way into many nooks and crannies of everyday, practical computing. The syntax used to represent them may vary, but the concepts are very much the same.
A shell regular expression can contain regular characters, standard wildcard characters, and additional operators that are more powerful than wildcards. Each such operator has the form x(exp), where x is the particular operator and exp is any regular expression (often simply a regular string). The operator determines how many occurrences of exp a string that matches the pattern can contain. Table 4-3 describes the shell’s regular expression operators and their meanings.
Operator | Meaning |
*(
exp
)
|
0 or more occurrences of exp |
+(
exp
)
|
1 or more occurrences of exp |
?(
exp
)
|
0 or 1 occurrences of exp |
@(
exp1
|
exp2
|...)
|
Exactly one of exp1 or exp2 or ... |
!(
exp
)
|
Anything that doesn’t match exp [a] |
[a]
Actually, |
As shown for the @(
exp1
|
exp2
|...)
pattern, an exp within any of the Korn shell operators can be a
series of exp1|exp2|... alternatives.
A little-known alternative notation is to separate each exp with the ampersand character, &
. In this case, all the
alternative expressions must match. Think of the |
as meaning “or,” while the &
means “and.” (You
can, in fact, use both of them in the same pattern list. The &
has higher precedence, with the meaning “match
this and that, OR match the next thing.”) Table 4-4
provides some example uses of the shell’s regular expression operators.
Expression | Matches |
x
|
x |
*(
x
)
|
Null string, x, xx, xxx, ... |
+(
x
)
|
x, xx, xxx, ... |
?(
x
)
|
Null string, x |
!(
x
)
|
Any string except x |
@(
x
)
|
x (see below) |
Regular expressions are extremely useful when dealing with arbitrary text, as you already know if you have used grep or the regular-expression capabilities of any Unix editor. They aren’t nearly as useful for matching filenames and other simple types of information with which shell users typically work. Furthermore, most things you can do with the shell’s regular expression operators can also be done (though possibly with more keystrokes and less efficiency) by piping the output of a shell command through grep or egrep.
Nevertheless, here are a few examples of how shell regular expressions can solve filename-listing problems. Some of these will come in handy in later chapters as pieces of solutions to larger tasks.
The Emacs editor supports customization files whose names end in .el (for Emacs LISP) or .elc (for Emacs LISP Compiled). List all Emacs customization files in the current directory.
In a directory of C source code, list all files that are not necessary. Assume that “necessary” files end in
.c
or.h
or are named Makefile or README.Filenames in the OpenVMS operating system end in a semicolon followed by a version number, e.g., fred.bob;23. List all OpenVMS-style filenames in the current directory.
Here are the solutions:
In the first of these, we are looking for files that end in .el with an optional c. The expression that matches this is
*.el?(c)
.The second example depends on the four standard subexpressions
*.c
,*.h
,Makefile
, andREADME
. The entire expression is!(*.c|*.h|Makefile|README)
, which matches anything that does not match any of the four possibilities.The solution to the third example starts with
*\;
, the shell wildcard*
followed by a backslash-escaped semicolon. Then, we could use the regular expression+([0-9])
, which matches one or more characters in the range[0-9]
, i.e., one or more digits. This is almost correct (and probably close enough), but it doesn’t take into account that the first digit cannot be 0. Therefore the correct expression is*\;[1-9]*([0-9])
, which matches anything that ends with a semicolon, a digit from 1 to 9, and zero or more digits from 0 to 9.
POSIX character class additions
The POSIX standard formalizes the meaning of regular expression characters and operators. The standard defines two classes of regular expressions: Basic Regular Expressions (BREs), which are the kind used by grep and sed, and Extended Regular Expressions, which are the kind used by egrep and awk.
In order to accommodate non-English environments, the POSIX standard enhanced
the ability of character set ranges (e.g., [a-z]
)
to match characters not in the English alphabet. For example, the French è is
an alphabetic character, but the typical character class [a-z]
would not match it. Additionally, the standard provides for
sequences of characters that should be treated as a single unit when matching
and collating (sorting) string data. (For example, there are locales where the
two characters ch
are treated as a unit and must
be matched and sorted that way.)
POSIX also changed what had been common terminology. What we saw earlier in Chapter
1 as a “range expression” is often called a “character class” in the
Unix literature. It is now called a “bracket expression” in the POSIX standard.
Within bracket expressions, besides literal characters such as a
, ;
, and so on, you can
also have additional components:
- Character classes
A POSIX character class consists of keywords bracketed by
[:
and:]
. The keywords describe different classes of characters such as alphabetic characters, control characters, and so on (see Table 4-5).- Collating symbols
A collating symbol is a multicharacter sequence that should be treated as a unit. It consists of the characters bracketed by
[.
and.]
.- Equivalence classes
An equivalence class lists a set of characters that should be considered equivalent, such as
e
andè
. It consists of a named element from the locale, bracketed by[=
and=]
.
All three of these constructs must appear inside the square brackets of a
bracket expression. For example [[:alpha:]!]
matches any single alphabetic character or the exclamation point; [[.ch.]]
matches the collating element ch
but does not match just the letter c
or the letter h
. In a
French locale, [[=e=]]
might match any of e
, è
, or é
. Classes and matching characters are shown in Table 4-5.
Class | Matching characters |
[:alnum:]
|
Alphanumeric characters |
[:alpha:]
|
Alphabetic characters |
[:blank:]
|
Space and tab characters |
[:cntrl:]
|
Control characters |
[:digit:]
|
Numeric characters |
[:graph:]
|
Printable and visible (non-space) characters |
[:lower:]
|
Lowercase characters |
[:print:]
|
Printable characters (includes whitespace) |
[:punct:]
|
Punctuation characters |
[:space:]
|
Whitespace characters |
[:upper:]
|
Uppercase characters |
[:xdigit:]
|
Hexadecimal digits |
The Korn shell supports all of these features within its pattern matching facilities. The POSIX character class names are the most useful, because they work in different locales.
The following section compares Korn shell regular expressions to analogous features in awk and egrep. If you aren’t familiar with these, skip to Section 4.5.3.
Korn shell versus awk/egrep regular expressions
Table 4-6 is an expansion of Table 4-3: the middle column shows the equivalents in awk/egrep of the shell’s regular expression operators.
Korn shell | egrep/awk | Meaning |
*(
exp
)
|
exp
*
|
0 or more occurrences of exp |
+(
exp
)
|
exp
+
|
1 or more occurrences of exp |
?(
exp
)
|
exp
?
|
0 or 1 occurrences of exp |
@(
exp1
|
exp2
|...)
|
exp1
|
exp2
|...
|
exp1 or exp2 or ... |
!(
exp
)
|
(none) | Anything that doesn’t match exp |
\
N
|
\
N (grep) |
Match same text as matched by previous parenthesized subexpression number N |
These equivalents are close but not quite exact. Because the shell would
interpret an expression like dave|fred|bob
as a
pipeline of commands, you must use @(dave|fred|bob)
for alternates by themselves.
The grep command has a feature called backreferences (or backrefs, for short). This facility provides a shorthand for repeating parts of a regular expression as part of a larger whole. It works as follows:
grep '\(abc\).*\1' file1 file2
This matches abc, followed by any
number of characters, followed again by abc. Up to nine parenthesized sub-expressions may be
referenced this way. The Korn shell provides an analogous capability. If you
use one or more regular expression patterns within a full pattern, you can
refer to previous ones using the \
N
notation as for grep.
For example:
@(dave|fred|bob)
matchesdave
,fred
, orbob
.@(*dave*&*fred*)
matchesdavefred
, andfreddave
. (Notice the need for the*
characters.)@(fred)*\1
matchesfreddavefred
,fredbobfred
, and so on.*(dave|fred|bob)
means, “0 or more occurrences ofdave
,fred
, orbob
“. This expression matches strings like the null string,dave
,davedave
,fred
,bobfred
,bobbobdavefredbobfred
, etc.+(dave|fred|bob)
matches any of the above except the null string.?(dave|fred|bob)
matches the null string,dave
,fred
, orbob
.!(dave|fred|bob)
matches anything exceptdave
,fred
, orbob
.
It is worth reemphasizing that
shell regular expressions can still contain standard shell wildcards. Thus, the
shell wildcard ?
(match any single character) is
equivalent to .
in egrep or awk, and
the shell’s character set operator [
...]
is the same as in those utilities.[53]
For example, the expression +([[:digit:]])
matches a number, i.e., one or more
digits. The shell wildcard character *
is
equivalent to the shell regular expression *(?)
.
You can even nest the regular expressions: +([[:digit:]]|!([[:upper:]]))
matches one or more digits or
non-uppercase letters.
Two egrep and awk regexp operators do not have equivalents in the Korn shell:
The beginning- and end-of-line operators
^
and$
.The beginning- and end-of-word operators
\<
and\>
.
These are hardly necessary, since the Korn shell doesn’t normally operate on
text files and does parse strings into words itself. (Essentially, the ^
and $
are implied as
always being there. Surround a pattern with *
characters to disable this.) Read on for even more features in the very latest
version of ksh.
Pattern matching with regular expressions
Starting with ksh93l, the shell provides a number of additional regular expression capabilities. We discuss them here separately, because your version of ksh93 quite likely doesn’t have them, unless you download a ksh93 binary or build ksh93 from source. The facilities break down as follows.
- New pattern matching operators
Several new pattern matching facilities are available. They are described briefly in Table 4-7. More discussion follows after the table.
- Subpatterns with options
Special parenthesized subpatterns may contain options that control matching within the subpattern or the rest of the expression.
- New [:word:] character class
The character class
[:word:]
within a bracket expression matches any character that is “word constituent.” This is basically any alphanumeric character or the underscore (_).- Escape sequences recognized within subpatterns
A number of escape sequences are recognized and treated specially within parenthesized expressions.
Operator | Meaning |
{
N
}(
exp
)
|
Exactly N occurrences of exp |
{
N
,
M
}(
exp
)
|
Between N and M occurrences of exp |
*-(
exp
)
|
0 or more occurrences of exp, shortest match |
+-(
exp
)
|
1 or more occurrences of exp, shortest match |
?-(
exp
)
|
0 or 1 occurrences of exp, shortest match |
@-(
exp1
|
exp2
|...)
|
Exactly one of exp1 or exp2 or ..., shortest match |
{
N
}-(
exp
)
|
Exactly N occurrences of exp, shortest match |
{
N
,
M
}-(
exp
)
|
Between N and M occurrences of exp, shortest match |
The first two operators in this table match facilities in egrep(1), called interval expressions. They let you specify that you want to match exactly N items, no more and no less, or that you want to match between N and M items.
The rest of the operators perform shortest or “non-greedy” matching. Normally, regular expressions match the longest possible text. A non-greedy match is one of the shortest possible text that matches. Non-greedy matching was first popularized by the perl language. These operators work with the pattern matching and substitution operators described in the next section; we delay examples of greedy vs. non-greedy matching until there. Filename wildcarding effectively always does greedy matching.
Within operations such as @(
...)
, you can provide a special subpattern that enables
or disables options for case independent and greedy matching. This subpattern
has one of the following forms:
~(+options
:pattern list
) Enable options ~(-options
:pattern list
) Disable options
The options are one or both of i
for case-independent matching and g
for greedy matching. If the :
pattern list
is omitted, the options
apply to the rest of the enclosing pattern. If provided, they apply to just
that pattern list. Omitting the options
is possible, as well, but doing so doesn’t really provide you with any new
value.
The bracket expression [[:word:]]
is a shorthand
for [[:alnum:]_]
. It is a notational convenience,
but one that can increase program legiblity.
Within parenthesized expressions, ksh recognizes all the standard ANSI C escape sequences, and they have their usual meaning. (See Section 7.3.3.1, in Chapter 7.) Additionally, the escape sequences listed in Table 4-8 are recognized and can be used for pattern matching.
Escape sequence | Meaning |
\d
|
Same as [[:digit:]]
|
\D
|
Same as [![:digit:]]
|
\s
|
Same as [[:space:]]
|
\S
|
Same as [![:space:]]
|
\w
|
Same as [[:word:]]
|
\W
|
Same as [![:word:]]
|
Whew! This is all fairly heady stuff. If you feel a bit overwhelmed by it, don’t worry. As you learn more about regular expressions and shell programming and begin to do more and more complex text processing tasks, you’ll come to appreciate the fact that you can do all this within the shell itself, instead of having to resort to external programs such as sed, awk, or perl.
Pattern-Matching Operators
Table 4-9 lists the Korn shell’s pattern-matching operators.
Operator | Meaning |
${
variable
#
pattern
}
|
If the pattern matches the beginning of the variable’s value, delete the shortest part that matches and return the rest. |
${
variable
##
pattern
}
|
If the pattern matches the beginning of the variable’s value, delete the longest part that matches and return the rest. |
${ {variable
%
pattern
}
|
If the pattern matches the end of the variable’s value, delete the shortest part that matches and return the rest. |
${
variable
%%
pattern
}
|
If the pattern matches the end of the variable’s value, delete the longest part that matches and return the rest. |
These can be hard to remember, so here’s a handy mnemonic device: #
matches the front because number signs precede numbers; %
matches the rear because percent signs follow numbers. Another mnemonic comes from the typical
placement (in the U.S.A., anyway) of the #
and %
keys on the keyboard. Relative to each other, the
#
is on the left, and the %
is on the right.
The classic use for pattern-matching operators is in stripping components from
pathnames, such as directory prefixes and filename suffixes. With that in mind,
here is an example that shows how all of the operators work. Assume that the
variable path
has the value /home/billr/mem/long.file.name
; then:
Expression | Result |
${path##/*/}
|
long.file.name |
${path#/*/}
|
billr/mem/long.file.name |
$path
|
/home/billr/mem/long.file.name
|
${path%.*}
|
/home/billr/mem/long.file
|
${path%%.*}
|
/home/billr/mem/loang
|
The two patterns used here are /*/
, which matches
anything between two slashes, and .*
, which matches a
dot followed by anything.
Starting with ksh93l, these operators
automatically set the .sh.match
array variable. This
is discussed in Section 4.5.7, later in this chapter.
We will incorporate one of these operators into our next programming task, Task 4-2.
Think of a C compiler as a pipeline of data processing components. C source code is input to the beginning of the pipeline, and object code comes out of the end; there are several steps in between. The shell script’s task, among many other things, is to control the flow of data through the components and designate output files.
You need to write the part of the script that takes the name of the input C
source file and creates from it the name of the output object code file. That is,
you must take a filename ending in .c
and create a
filename that is similar except that it ends in .o
.
The task at hand is to strip the .c
off the filename
and append .o
. A single shell statement does it:
objname=${filename%.c}.o
This tells the shell to look at the end of filename
for .c
. If there is a match, return $filename
with the match deleted. So if filename
had the value fred.c
, the expression ${filename%.c}
would return fred
. The .o
is appended to make the desired fred.o
,
which is stored in the variable objname
.
If filename
had an inappropriate value (without .c
) such as fred.a
, the above expression
would evaluate to fred.a.o
: since there was no match,
nothing is deleted from the value of filename
, and
.o
is appended anyway. And, if filename
contained more than one dot — e.g., if it were
the y.tab.c that is so infamous among
compiler writers — the expression would still produce the desired y.tab.o. Notice that this would not be true if
we used %%
in the expression instead of %
.
The former operator uses the longest
match instead of the shortest, so it would match .tab.o
and evaluate to y.o
rather than
y.tab.o
. So the single %
is correct in this case.
A longest-match deletion would be preferable, however, for Task 4-3.
Clearly the objective is to remove the directory prefix from the pathname. The following line does it:
bannername=${pathname##*/}
This solution is similar to the first line in the examples shown before. If pathname
were just a filename, the pattern */
(anything followed by a slash) would not match, and
the value of the expression would be $pathname
untouched. If pathname
were something like fred/bob
, the prefix fred/
would match the pattern and be deleted, leaving just bob
as the expression’s value. The same thing would happen if pathname
were something like /dave/pete/fred/bob
: since the ##
deletes
the longest match, it deletes the entire /dave/pete/fred/
.
If we used #*/
instead of ##*/
, the expression would
have the incorrect value dave/pete/fred/bob
, because
the shortest instance of “anything followed by a slash” at the beginning of the
string is just a slash (/
).
The construct ${
variable
##*/}
is actually quite similar to to the Unix
utility basename(1).
In typical use, basename takes a pathname as argument and returns the
filename only; it is meant to be used with the shell’s command substitution
mechanism (see below). basename is less
efficient than ${
variable
##/*}
because it may run in its own separate process
rather than within the shell.[55]
Another utility, dirname(1), does essentially the opposite of
basename: it returns the directory
prefix only. It is equivalent to the Korn shell expression ${
variable
%/*}
and is less efficient for the same reason.
Pattern Substitution Operators
Besides the pattern-matching operators that delete bits and pieces from the values of shell variables, you can do substitutions on those values, much as in a text editor. (In fact, using these facilities, you could almost write a line-mode text editor as a shell script!) These operators are listed in Table 4-10.
Operator | Meaning |
${
variable
:
start
}
|
These represent substring operations. The result is the value of variable starting at position start and going for length characters. The first character is at position 0, and if no length is provided, the rest of the string is used. When used with Beginning with ksh93m, a negative start is taken as relative to the end of the string. For example, if a string has 10 characters, numbered 0 to 9, a start value of -2 means 7 (9 - 2 = 7). Similarly, if variable is an indexed array, a negative start yields an index by working backwards from the highest subscript in the array. |
${
variable
:
start
:
length
}
|
|
${
variable
/
pattern
/
replace
}
|
If variable contains a match for pattern, the first match is replaced with the text of replace. |
${
variable
//
pattern
/
replace
}
|
This is the same as the previous operation, except that every match of the pattern is replaced. |
${
variable
/
pattern
}
|
If variable contains a match for pattern, delete the first match of pattern. |
${
variable
/#
pattern
/
replace
}
|
If variable contains a match for pattern, the first match is replaced with the text of replace. The match is constrained to occur at the beginning of variable’s value. If it doesn’t match there, no substitution occurs. |
${
variable
/%
pattern
/
replace
}
|
If variable contains a match for pattern, the first match is replaced with the text of replace. The match is constrained to occur at the end of variable’s value. If it doesn’t match there, no substitution occurs. |
The ${
variable
/
pattern
}
syntax is different from the #
, ##
, %
, and
%%
operators we saw earlier. Those operators are
constrained to match at the beginning or end of the variable’s value, whereas the
syntax shown here is not. For example:
$path=/home/fred/work/file
$print ${path/work/play}
Change work into play /home/fred/play/file
Let’s return to our compiler front-end example and look at how we might use these operators. When turning a C source filename into an object filename, we could do the substitution this way:
objname=${filename/%.c/.o} Change .c to .o, but only at end
If we had a list of C filenames and wanted to change all of them into object filenames, we could use the so-called global substitution operator:
$allfiles="fred.c dave.c pete.c"
$allobs=${allfiles//.c/.o}
$print $allobs
fred.o dave.o pete.o
The patterns may be any Korn shell pattern expression, as discussed earlier, and
the replacement text may include the \
N
notation to get the text that matched
a subpattern.
Finally, these operations may be applied to the positional parameters and to arrays, in which case they are done on all the parameters or array elements at once. (Arrays are described in Chapter 6.)
$print "$@"
hi how are you over there $print ${@/h/H}
Change h to H in all parameters Hi How are you over tHere
Greedy versus non-greedy matching
As promised, here is a brief demonstration of the differences between greedy and non-greedy matching regular expressions:
$x='12345abc6789'
$print ${x//+([[:digit:]])/X}
Substitution with longest match XabcX $print ${x//+-([[:digit:]])/X}
Substitution with shortest match XXXXXabcXXXX $print ${x##+([[:digit:]])}
Remove longest match abc6789 $print ${x#+([[:digit:]])}
Remove shortest match 2345abc6789
The first print replaces the longest
match of “one or more digits” with a single X
,
everywhere throughout the string. Since this is a longest match, both groups of
digits are replaced. In the second case, the shortest match for “one or more
digits” is just a single digit, and thus each digit is replaced with an X
.
Similarly, the third and fourth cases demonstrate removing text from the front of the value, using longest and shortest matching. In the third case, the longest match removes all the digits; in the fourth case, the shortest match removes just a single digit.
Variable Name Operators
A number of operators relate to shell variable names, as seen in Table 4-11.
Operator | Meaning |
${!
variable
}
|
Return the name of the real variable referenced by the nameref variable. |
${!
base
*}
|
List of all variables whose names begin with base. |
${!
base
@}
|
Namerefs were discussed in Section 4.4, earlier in
this chapter. See there for an example of ${!
name
}
.
The last two operators in Table 4-11 might be useful for debugging and/or tracing the use of variables in a large script. Just to see how they work:
$print ${!HIST*}
HISTFILE HISTCMD HISTSIZE $print ${!HIST@}
HISTFILE HISTCMD HISTSIZE
Several other operators related to array variables are described in Chapter 6.
Length Operators
There are three remaining operators on
variables. One is ${#
varname
}
, which returns the number of characters in the
string.[56] (In Chapter 6 we see how to treat this and similar values as actual numbers so
they can be used in arithmetic expressions.) For example, if filename
has the value fred.c
, then ${#filename}
would have the value 6
. The other two operators (${#
array
[*]}
and ${#
array
[@]}
) have to do with array variables, which are also
discussed in Chapter 6.
The .sh.match Variable
The .sh.match
variable was introduced in ksh93l. It is an indexed array (see Chapter 6),
whose values are set every time you do a pattern matching operation on a variable,
such as ${filename%%*/}
, with any of the #
, %
operators (for the
shortest match), or ##
, %%
(for the longest match), or /
and //
(for substitutions). .sh.match[0]
contains the text that matched the entire pattern. .sh.match[1]
contains the text that matched the first
parenthesized subexpression, .sh.match[2]
the text
that matched the second, and so on. The values of .sh.match
become invalid (meaning, don’t try to use them) if the
variable on which the pattern matching was done changes.
Again, this is a feature meant for more advanced programming and text processing, analogous to similar features in other languages such as perl. If you’re just starting out, don’t worry about it.
Command Substitution
From the discussion so far, we’ve seen two ways of getting values into variables: by assignment statements and by the user supplying them as command-line arguments (positional parameters). There is another way: command substitution, which allows you to use the standard output of a command as if it were the value of a variable. You will soon see how powerful this feature is.
The syntax of command substitution is:
$(Unix command)
The command inside the parenthesis is run, and anything the command writes to standard output (and to standard error) is returned as the value of the expression. These constructs can be nested, i.e., the Unix command can contain command substitutions.
Here are some simple examples:
The value of
$(pwd)
is the current directory (same as the environment variable$PWD
).The value of
$(ls)
is the names of all files in the current directory, separated by newlines.To find out detailed information about a command if you don’t know where its file resides, type
ls -l $(whence -p
command
)
. The -p option forces whence to do a pathname lookup and not consider keywords, built-ins, etc.To get the contents of a file into a variable, you can use
varname
=$(<
filename
)
.$(cat
filename
)
will do the same thing, but the shell catches the former as a built-in shorthand and runs it more efficiently.If you want to edit (with Emacs) every chapter of your book on the Korn shell that has the phrase “command substitution,” assuming that your chapter files all begin with ch, you could type:
emacs $(grep -l 'command substitution' ch*.xml)
The -l option to grep prints only the names of files that contain matches.
Command substitution, like variable expansion, is done within double quotes. (Double quotes inside the command substitution are not affected by any enclosing double quotes.) Therefore, our rule in Chapter 1 and Chapter 3 about using single quotes for strings unless they contain variables will now be extended: “When in doubt, use single quotes, unless the string contains variables, or command substitutions, in which case use double quotes.”
(For backwards compatibility, the Korn shell supports the original Bourne shell (and
C shell) command substituion notation using backquotes: `
...`
. However, it is considerably harder to
use than $(
...)
, since
quoting and nested command substitutions require careful escaping. We don’t use the
backquotes in any of the programs in this book.)
You will undoubtedly think of many ways to use command substitution as you gain experience with the Korn shell. One that is a bit more complex than those mentioned previously relates to a customization task that we saw in Chapter 3: personalizing your prompt string.
Recall that you can personalize your
prompt string by assigning a value to the variable PS1
.
If you are on a network of computers, and you use different machines from time to
time, you may find it handy to have the name of the machine you’re on in your prompt
string. Most modern versions of Unix have
the command hostname(1), which prints the
network name of the machine you are on to standard output. (If you do not have this
command, you may have a similar one like uname.) This command enables you to get the machine name into your
prompt string by putting a line like this in your .profile or environment file:
PS1="$(hostname) $ "
(Here, the second dollar sign does not need to be preceded by a backslash. If the
character after the $
isn’t special to the shell, the
$
is included literally in the string.) For example,
if your machine had the name coltrane
, then this
statement would set your prompt string to "coltrane $
“.
Command substitution helps us with the solution to the next programming task, Task 4-4, which relates to the album database in Task 4-1.
The cut(1) utility is a natural for this task. cut is a data filter: it extracts columns from tabular data.[57] If you supply the numbers of columns you want to extract from the input, cut prints only those columns on the standard output. Columns can be character positions or — relevant in this example — fields that are separated by TAB characters or other delimiters.
Assume that the data table in our task is a file called albums and that it looks like this:
Coltrane, John|Giant Steps|Atlantic|1960|Ja Coltrane, John|Coltrane Jazz|Atlantic|1960|Ja Coltrane, John|My Favorite Things|Atlantic|1961|Ja Coltrane, John|Coltrane Plays the Blues|Atlantic|1961|Ja ...
Here is how we would use cut to extract the fourth (year) column:
cut -f4 -d\| albums
The -d argument is used to specify the character used as field delimiter (TAB is the default). The vertical bar must be backslash-escaped so that the shell doesn’t try to interpret it as a pipe.
From this line of code and the getfield routine, we can easily derive the solution to the task. Assume that the first argument to getfield is the name of the field the user wants to extract. Then the solution is:
fieldname=$1 cut -f$(getfield $fieldname) -d\| albums
If we ran this script with the argument year
, the
output would be:
1960 1960 1961 1961 ...
Task 4-5 is another small task that makes use of cut.
The command who(1) tells you who is logged in (as well as which terminal they’re on and when they logged in). Its output looks like this:
billr console May 22 07:57 fred tty02 May 22 08:31 bob tty04 May 22 08:12
The fields are separated by spaces, not TABs. Since we need the first field, we can get away with using a space as the field separator in the cut command. (Otherwise we’d have to use the option to cut that uses character columns instead of fields.) To provide a space character as an argument on a command line, you can surround it by quotes:
who | cut -d' ' -f1
With the above who output, this command’s output would look like this:
billr fred bob
This leads directly to a solution to the task. Just type:
mail $(who | cut -d' ' -f1)
The command mail billr fred bob
will run and then you
can type your message.
Task 4-6 is another task that shows how useful command pipelines can be in command substitution.
This task was inspired by the feature of the OpenVMS operating system that lets you specify files by date with BEFORE and SINCE parameters.
Here is a function that allows you to list all files that were last modified on the date you give as argument. Once again, we choose a function for speed reasons. No pun is intended by the function’s name:
function lsd { date=$1 ls -l | grep -i "^.\{41\}$date" | cut -c55- }
This function depends on the column
layout of the ls -l
command. In particular, it depends
on dates starting in column 42 and filenames starting in column 55. If this isn’t the
case in your version of Unix, you will need to adjust the column numbers.[58]
We use the grep search utility to match the date given as argument (in
the form Mon
DD, e.g., Jan 15
or Oct 6
, the latter having two spaces) to the output of
ls -l
. (The regular expression argument to grep is quoted with double quotes, in order to
perform the variable substitution.) This gives us a long listing of only those files
whose dates match the argument. The -i option
to grep allows you to use all lowercase
letters in the month name, while the rather fancy argument means, “Match any line
that contains 41 characters followed by the function argument.” For example, typing
lsd 'jan 15'
causes grep to search for lines that match any 41 characters followed
by jan 15
(or Jan 15
).
The output of grep is piped through our ubiquitous friend cut to retrieve just the filenames. The argument to cut tells it to extract characters in column 55 through the end of the line.
With command substitution, you can use this function with any command that accepts filename arguments. For example, if you want to print all files in your current directory that were last modified today, and today is January 15, you could type:
lp $(lsd 'jan 15')
The output of lsd is on multiple lines (one
for each filename), but because the variable IFS
(see
earlier in this chapter) contains newline by default, the shell uses newline to
separate words in lsd’s output, just as it
normally does with space and TAB.
Advanced Examples: pushd and popd
We conclude this chapter with a couple of functions that you may find handy in your everyday Unix use. They solve the problem presented by Task 4-7.
We start by implementing a significant subset of their capabilities and finish the implementation in Chapter 6. (For ease of development and explanation, our implementation ignores some things that a more bullet-proof version should handle. For example, spaces in filenames will cause things to break.)
If you don’t know what a stack is, think of a spring-loaded dish receptacle in a cafeteria. When you place dishes on the receptacle, the spring compresses so that the top stays at roughly the same level. The dish most recently placed on the stack is the first to be taken when someone wants food; thus, the stack is known as a “last-in, first-out” or LIFO structure. (Victims of a recession or company takeovers will also recognize this mechanism in the context of corporate layoff policies.) Putting something onto a stack is known in computer science parlance as pushing, and taking something off the top is called popping.
A stack is very handy for remembering
directories, as we will see; it can “hold your place” up to an arbitrary number of
times. The cd -
form of the cd command does this, but only to one level. For example: if
you are in firstdir and then you change to
seconddir, you can type cd -
to go back. But if you start out in firstdir, then change to seconddir, and then go to thirddir, you can use cd -
only
to go back to seconddir. If you type cd -
again, you will be back in thirddir, because it is the previous directory.[59]
If you want the “nested” remember-and-change functionality that will take you back to firstdir, you need a stack of directories along with the dirs, pushd and popd commands. Here is how these work:[60]
pushd dir does a cd to dir and then pushes dir onto the stack.
popd does a cd to the top directory, then pops it off the stack.
For example, consider the series of events in Table 4-12. Assume that you have just logged in and that you are in your home directory (/home/you).
We will implement a stack as an environment variable containing a list of directories separated by spaces.
Command | Stack contents (top on left) | Result directory |
pushd fred
|
/home/you/fred | /home/you/fred |
pushd /etc
|
/etc /home/you/fred | /etc |
cd /usr/tmp
|
/etc /home/you/fred | /usr/tmp |
popd
|
/home/you/fred | /etc |
popd
|
(empty) | /home/you/fred |
Your directory stack should be initialized to your home directory when you log in. To do so, put this in your .profile:
DIRSTACK="$PWD" export DIRSTACK
Do not put this in your environment file if
you have one. The export statement guarantees
that DIRSTACK
is known to all subprocesses; you want to
initialize it only once. If you put this code in an environment file, it will get
reinitialized in every interactive shell subprocess, which you probably don’t want.
Next, we need to implement dirs, pushd, and popd as functions. Here are our initial versions:
function dirs { # print directory stack (easy) print $DIRSTACK } function pushd { # push current directory onto stack dirname=$1 cd ${dirname:?"missing directory name."} DIRSTACK="$PWD $DIRSTACK" print "$DIRSTACK" } function popd { # cd to top, pop it off stack top=${DIRSTACK%% *} DIRSTACK=${DIRSTACK#* } cd $top print "$PWD" }
Notice that there isn’t much code! Let’s go through the functions and see how they work. dirs is easy; it just prints the stack. The fun starts with pushd. The first line merely saves the first argument in the variable dirname for readability reasons.
The second line’s main purpose is to change to the new directory. We use the :?
operator to handle the error when the argument is
missing: if the argument is given, the expression ${dirname:?"missing directory name."}
evaluates to $dirname
, but if it is not given, the shell prints the message ksh: pushd: line 2: dirname: missing directory name.
and
exits from the function.
The third line of the function pushes the new directory onto the stack. The
expression within double quotes consists of the full pathname for the current
directory, followed by a single space, followed by the contents of the directory
stack ($DIRSTACK
). The double quotes ensure that all of
this is packaged into a single string for assignment back to DIRSTACK
.
The last line merely prints the contents of the stack, with the implication that the
leftmost directory is both the current directory and at the top of the stack. (This
is why we chose spaces to separate directories, rather than the more customary colons
as in PATH
and MAILPATH
.)
The popd function makes yet another use of
the shell’s pattern-matching operators.
The first line uses the %%
operator, which deletes the longest match of " *
" (a space followed by anything). This removes all but the
top of the stack. The result is saved in the variable top
, again for readability reasons.
The second line is similar, but going in
the other direction. It uses the #
operator, which tries
to delete the shortest match of the pattern "*
"
(anything followed by a space) from the value of DIRSTACK
. The result is that the top directory (and the space following
it) is deleted from the stack.
The third line actually changes directory to the previous top of the stack. (Note that popd doesn’t care where you are when you run it; if your current directory is the one on the top of the stack, you won’t go anywhere.) The final line just prints a confirmation message.
This code is deficient in the following ways: first, it has no provision for errors. For example:
What if the user tries to push a directory that doesn’t exist or is invalid?
What if the user tries popd and the stack is empty?
Test your understanding of the code by figuring out how it would respond to these error conditions. The second deficiency is that the code implements only some of the functionality of the C shell’s pushd and popd commands — albeit the most useful parts. In the next chapter, we will see how to overcome both of these deficiencies.
The third problem with the code is that it will not work if, for some reason, a directory name contains a space. The code will treat the space as a separator character. We’ll accept this deficiency for now. However, when you read about arrays in Chapter 6, think about how you might use them to rewrite this code and eliminate the problem.
[46] This actually depends on the setting of your umask, an advanced feature described in Chapter 10.
[48] ksh93 point releases h through l+ used a similar but more restricted mechanism, via a file named .fpath, and they hard-wired the setting of the library path variable. As this feature was not wide-spread, it was generalized into a single file starting with point release m.
[49] autoload is actually an alias for typeset -fu.
[50] This is a restriction imposed by the Korn shell, not by the POSIX standard.
[51] However, see the section on typeset in Chapter 6 for a way of making variables local to functions.
[52] The Version 6 shell was written by Ken Thompson. Stephen Bourne wrote the Bourne shell for Version 7.
[53] And, for that matter,
the same as in grep, sed, ed, vi, etc. One notable difference is that the
shell uses !
inside [
...]
for negation, while the various
utilities all use ^
.
[54] Don’t laugh — once upon a time, many Unix compilers had shell scripts as front-ends.
[55] basename may be built-in in some versions of ksh93. Thus it’s not guaranteed to run in a separate process.
[56] This may be more than the number of bytes for multibyte character sets.
[57]
Some very old BSD-derived systems
don’t have cut, but you can use awk instead. Whenever you see a command of the
form cut -f
N
-d
C filename
, use this instead: awk -F
C
'{ print $
N
}'
filename
.
[58] For example, ls -l
on GNU/Linux has dates starting in column 43 and
filenames starting in column 57.
[59]
Think of cd
-
as a synonym for cd $OLDPWD
; see the
previous chapter.
[60] We’ve done it here differently from the C shell. The C shell pushd pushes the initial directory onto the stack first, followed by the command’s argument. The C shell popd removes the top directory off the stack, revealing a new top. Then it cds to the new top directory. We feel that this behavior is less intuitive than our design here.
Get Learning the Korn Shell, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.