BUY THIS BOOK

Safari Books Online

What is this?

Looking to Reprint this content?


Advanced Perl Programming
Advanced Perl Programming

By Sriram Srinivasan

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Data References and Anonymous Storage
If I were meta-agnostic, I'd be confused over whether I'm agnostic or not — but I'm not quite sure if I feel that way; hence I must be meta-meta-agnostic (I guess).
—Douglas R. Hofstadter, Gödel, Escher, Bach
There are two aspects (among many) that distinguish toy programming languages from those used to build truly complex systems. The more robust languages have:
  • The ability to dynamically allocate data structures without having to associate them with variable names. We refer to these as "anonymous" data structures.
  • The ability to point to any data structure, independent of whether it is allocated dynamically or statically.
COBOL is the one true exception to this; it has been a huge commercial success in spite of lacking these features. But it is also why you'd balk at developing flight control systems in COBOL.
Consider the following statements that describe a far simpler problem: a family tree.
Marge is 23 years old and is married to John, 24.
Jason, John's brother, is studying computer science at MIT. He is just 19.
Their parents, Mary and Robert, are both sixty and live in Florida.
Mary and Marge's mother, Agnes, are childhood friends.
Do you find yourself mentally drawing a network with bubbles representing people and arrows representing relationships between them? Think of how you would conveniently represent this kind of information in your favorite programming language. If you were a C (or Algol, Pascal, or C++) programmer, you would use a dynamically allocated data structure to represent each person's data (name, age, and location) and pointers to represent relationships between people.
A pointer is simply a variable that contains the location of some other piece of data. This location can be a machine address, as it is in C, or a higher-level entity, such as a name or an array offset.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Referring to Existing Variables
If you have a C background (not necessary for understanding this chapter), you know that there are two ways to initialize a pointer in C. You can refer to an existing variable:
int a, *p;
p = &a;  /* p now has the "address" of a */
The memory is statically allocated; that is, it is allocated by the compiler. Alternatively, you can use malloc(3) to allocate a piece of memory at run-time and obtain its address:
p = malloc(sizeof(int));
This dynamically allocated memory doesn't have a name (unlike that associated with a variable); it can be accessed only indirectly through the pointer, which is why we refer to it as "anonymous storage."
Perl provides references to both statically and dynamically allocated storage; in this section, we'll study the former in some detail. That allows us to deal with the two concepts—references and anonymous storage—separately.
You can create a reference to an existing Perl variable by prefixing it with a backslash, like this:
# Create some variables
$a      = "mama mia";
@array  = (10, 20);
%hash   = ("laurel" => "hardy", "nick" =>  "nora");

# Now create references to them
$ra     = \$a;          # $ra now "refers" to (points to) $a
$rarray = \@array;
$rhash  = \%hash;
You can create references to constant scalars in a similar fashion:
$ra     = \10;
$rs     = \"hello world";
That's all there is to it. Since arrays and hashes are collections of scalars, it is possible to take a reference to an individual element the same way: just prefix it with a backslash:
$r_array_element = \$array[1];       # Refers to the scalar $array[1]

$r_hash_element  = \$hash{"laurel"}; # Refers to the scalar
                                     # $hash{"laurel"}
A reference variable, such as $ra or $rarray, is an ordinary scalar—hence the prefix `$'. A scalar, in other words, can be a number, a string, or a reference and can be freely reassigned to one or the other of these (sub)types. If you print a scalar while it is a reference, you get something like this:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using References
References are absolutely essential for creating complex data structures. Since the next chapter is devoted solely to this topic, we will not say more here. This section lists the other advantages of Perl's support for indirection and memory management.
When you pass more than one array or hash to a subroutine, Perl merges all of them into the @_ array available within the subroutine. The only way to avoid this merger is to pass references to the input arrays or hashes. Here's an example that adds elements of one array to the corresponding elements of the other:
@array1 = (1, 2, 3); @array2 = (4, 5, 6, 7);
AddArrays (\@array1, \@array2); # Passing the arrays by reference.
print "@array1 \n";
sub AddArrays 
{
		my ($rarray1, $rarray2) = @_;
		$len2 = @$rarray2;  # Length of array2
		for ($i = 0 ; $i  < $len2 ;  $i++) {
			$rarray1->[$i] += $rarray2->[$i];   
		}
}
In this example, two array references are passed to AddArrays which then dereferences the two references, determines the lengths of the arrays, and adds up the individual array elements.
Using references, you can efficiently pass large amounts of data to and from a subroutine.
However, passing references to scalars typically turns out not to be an optimization at all. I have often seen code like this, in which the programmer has intended to minimize copying while reading lines from a file:
while ($ref_line = GetNextLine()) {
		.....
		.....
}
sub GetNextLine () {
		my $line = <F> ;
	    exit(0) unless defined($line);
		.....
		return \$line;    # Return by reference, to avoid copying
}
GetNextLine returns the line by reference to avoid copying.
You might be surprised how little an effect this strategy has on the overall performance, because most of the time is taken by reading the file and subsequently working on
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Nested Data Structures
Recall that arrays and hashes contain only scalars; they cannot directly contain another array or hash as such. But considering that references can refer to an array or a hash and that references are scalars, you can see how one or more elements in an array or hash can point to other arrays or hashes. In this section, we will study how to build nested, heterogeneous data structures.
Let us say we would like to track a person's details and that of their dependents. One approach is to create separate named hash tables for each person:
%sue = (              # Parent
    'name' => 'Sue',
    'age'  => '45');
%john = (             # Child
    'name' => 'John',
    'age'  => '20');
%peggy = (            # Child
    'name' => 'Peggy',
    'age'  => '16');
The structures for John and Peggy can now be related to Sue like this:
@children = (\%john, \%peggy);
$sue{'children'} = \@children;

# Or
$sue{'children'} = [\%john, \%peggy];
Figure 1.2 shows this structure after it has been built.
Figure 1.2: Mixing scalars with arrays and hashes.
This is how you can print Peggy's age, given %sue:
print $sue{children}->[1]->{age};
Suppose the first line in your program is this:
$sue{children}->[1]->{age} = 10;
Perl automatically creates the hash %sue, gives it a hash element indexed by the string children, points that entry to a newly allocated array, whose second element is made to refer to a freshly allocated hash, which gets an entry indexed by the string age. Talk about programmer efficiency.
While on the subject of programmer efficiency, let us discuss one more optimization for typing. You can omit -> if (and only if) it is between subscripts. That is, the following expressions are equivalent:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Querying a Reference
The ref function queries a scalar to see whether it contains a reference and, if so, what type of data it is pointing to. ref returns false (a Boolean value, not a string) if its argument contains a number or a string; and if it's a reference, ref returns one of these strings to describe the data being referred to: "SCALAR", "HASH", "ARRAY", "REF" (referring to another reference variable), "GLOB" (referring to a typeglob), "CODE" (referring to a subroutine), or "package name" (an object belonging to this package—we'll see more of it later).
$a = 10;
$ra = \$a;
ref($a) yields FALSE, since $a is not a reference.
ref($ra) returns the string "SCALAR", since $ra is pointing to a scalar value.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Symbolic References
Normally, a construct such as $$var indicates that $var is a reference variable, and the programmer expects this expression to return the value that was pointed to by $var when the references were taken.
What if $var is not a reference variable at all? Instead of complaining loudly, Perl checks to see whether $var contains a string. If so, it uses that string as a regular variable name and messes around with this variable! Consider the following:
$x = 10;
$var = "x";
$$var = 30;   # Modifies $x to 30 , because $var is a symbolic
              # reference !
When evaluating $$var, Perl first checks to see whether $var is a reference, which it is not; it's a string. Perl then decides to give the expression one more chance: it treats $var's contents as a variable identifier ($x). The example hence ends up modifying $x to 30.
It is important to note that symbolic references work only for global variables, not for those marked private using my.
Symbolic references work equally well for arrays and hashes also:
$var = "x";
@$var = (1, 2, 3);   # Sets @x to the enumerated list on the right
Note that the symbol used before $var dictates the type of variable to access: $$var is equivalent to $x, and @ $var is equivalent to saying @ x.
This facility is immensely useful, and, for those who have done this kind of thing before with earlier versions of Perl, is much more efficient than using eval. Let us say you want your script to process a command-line option such as "-Ddebug_level=3" and set the $debug_level variable. This is one way of doing it:
while ($arg = shift @ARGV){
    if ($arg =~ /-D(\w+)=(\w+)/) {
         $var_name = $1; $value = $2;
         $$var_name = $value;     # Or more compactly, $$1 = $2;
    }
}
On the other hand, Perl's eagerness to try its damnedest to get an expression to work sometimes doesn't help. In the preceding examples, if you expected the program logic to have a real reference instead of a string, then you would have wanted Perl to point it out instead of making assumptions about your usage. Fortunately, there's a way to switch this eagerness off. Perl has a number of compile-time directives, or pragmas. The
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A View of the Internals
Let us now take a look inside Perl to understand how it manages memory. You can safely skip this section without loss of continuity.
A variable logically represents a binding between a name and a value, as Figure 1.3 illustrates.
Figure 1.3: A variable is a name and value pair
An array or a hash is not just a collection of numbers or strings. It is a collection of scalar values, and this distinction is important, as Figure 1.4 illustrates.
Figure 1.4: An array value is a collection of scalar values
Each box in Figure 1.4 represents a distinct value. An array has one value that represents the collection of scalar values. Each element of the array is a distinct scalar value. This is analogous to a pride of lions being treated as a single entity (which is why we refer to it in the singular) that has properties distinct from those of the individual lion.
Notice also that while a name always points to a value, a value doesn't always have to be pointed to by a name, as the array elements in Figure 1.4 or anonymous arrays and hashes exemplify.
To support painless and transparent memory management, Perl maintains a reference count for every value, whether it is directly pointed to by a name or not. Let's add this piece of information to our earlier view. Refer to Figure 1.5.
Figure 1.5: Adding reference counts to all values
As you can see, the reference count represents the number of arrows pointing to the value part of a variable. Because there is always an arrow from the name to its value, the variable's reference count is at least 1. When you obtain a reference to a variable, the corresponding value's reference count is incremented.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
References in Other Languages
Tcl does not have a way of dynamically allocating anonymous data structures but, being a dynamic language, supports creation of new variables (names assigned automatically) at run-time. This approach is not only slow, but also highly error prone. In addition, the only way to pass a variable by reference is to pass the actual name of a variable, equivalent to Perl's symbolic references. All this makes it very difficult to create complex data structures (and very unmaintainable if you do so). But, in all fairness, it must be stressed that Tcl is meant to be a glue language between applications and toolkits, and it is expected that most complex processing happens in the C-based application itself, rather than within the script. Tcl was not designed to be used as a full-fledged scripting or development language (though I have heard that its limited scope hasn't stopped people from writing 50,000-line scripts to drive oil rigs!).
Python is similar to Java in that, except for fundamental types, all objects are passed around by reference. This means that assigning a list-valued variable to another simply results in the second list variable being an alias of the first; if you want a copy, you have to explicitly do so and pay the corresponding price in performance. I much prefer this style to Perl's because you typically refer to structures much more than making a copy, and it is nice to have a default that is efficient and eases typing.
Like Perl, Python reference counts each of its data types, including user-defined types defined in C/C++ extensions.
C and C++ support pointers whose type safety can be checked at compile time. Since a pointer contains the raw address of the data, a reference to a piece of data is as efficient and compact as it gets. On the other hand, this puts all the responsibility of memory management on the programmer. It is worth examining the implementation of interpreters such as Tcl, Perl, and Python (all having been implemented in C) to learn about memory management strategies.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Resources
  1. perlref (Perl documentation)
  2. Uniprocessor Garbage Collection Techniques. Paul Wilson. International Workshop on Memory Management, 1992.
    This paper gives a comprehensive treatment of GC issues. Available from ftp://ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Implementing Complex Data Structures
Don't worry, spiders,I keep house casually.
—Kobayashi Issa
The success of Perl is a tribute to the fact that many problems can be solved by using just its fundamental data types. Jon Bentley's books Programming Pearls and More Programming Pearls are further testament to how much can be achieved if the basic data structures are dynamic and memory management is automatic. But as programs become more complex, moving from the domain of the script to that of the application, there is an increasing need for representing data in much more complex ways than can sometimes be achieved with the basic data types alone.
In this chapter, we will apply the syntax and concepts learned in Chapter 1 to a few "real" examples. We will write bits of code that build complex structures from file-based data and use sequences of $'s and @'s without batting an eyelid. For each problem, we will examine different ways of representing the same data and study the trade-offs in program versus programmer efficiency. In the interest of clarity, we will not worry too much about error handling.
Tom Christiansen has written an excellent series of tutorials called FMTEYEWTK (Far More Than Everything You've Ever Wanted to Know!) [Section 2.6]. This series contains a motley collection of topics that crop up on the Perl Usenet groups. I admire them for their lucid, patient, and detailed explanations and recommend that you read them at some point. (Now is better!) Some of them are now packaged with the Perl distribution; in particular, the perldsc (data structures cookbook) document is a tutorial for building and manipulating complex structures.
Before we start the examples, we will study what it takes to create structures à la C or C++.
The struct declaration in C provides a notion of user-defined types (though it doesn't quite have first-class status, like an int), and a typedef statement is then used to alias it to a new type name. Java and C++ have the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
User-Defined Structures
The struct declaration in C provides a notion of user-defined types (though it doesn't quite have first-class status, like an int), and a typedef statement is then used to alias it to a new type name. Java and C++ have the class declaration to compose new data types out of fundamental data types. These constructs allow you to combine a bunch of named attributes under a single banner but still provide access to each individual attribute.
Perl has no such built-in template feature. One commonly used convention is to simulate structures using a hash table, as shown in Figure 2.1.
Figure 2.1: Simulating C structures with Perl hashes
Perl's implementation of hash tables is actually quite efficient in terms of both performance and space. Since hash keys are immutable strings, Perl keeps only one systemwide copy of a hash key. This prevents a hundred foo structures from creating a hundred copies of the strings a and str.
Another way to create a user-defined collection of attributes is to use an array @foo instead, which is slightly more efficient, yet a tad more cumbersome:
$a = 0; $str = 1;     # Indices
$foo[$a]   = 10;      # Equivalent to foo.a = 10 in C.
$foo[$str] = "hello"; # equivalent to foo.str = "hello" in C.
Remember, if a certain data structure is represented far more easily in C than in Perl and requires a considerable amount of manipulation, you could consider keeping it in C/C++ itself and not bother duplicating it in Perl. You will need to provide a set of C procedures that can manipulate this data. A very simple tool called SWIG (discussed in Chapter 18) allows you to do this painlessly.
You can also use pack or sprintf to encode a set of values to get one composite entity, but accessing individual data elements is neither convenient nor efficient (in time). pack is a good option when you need to be frugal about space, because it converts a list of values into one scalar value without necessarily changing each individual item's machine representation;
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Example: Matrices
Before we embark on this example, you must know that if you really want a good efficient implementation of matrices, you should check out the PDL module (Perl Data Language) from CPAN.
To gain a better understanding of different matrix representations, we will write routines to construct these structures from a data file and to multiply two matrices. The file is formatted as follows:
MAT1
1  2  4
10 30 0

MAT2
5  6 
1  10
Each matrix has a label and some data. We use these labels to create global variables with the corresponding names (@MAT1 and @MAT2).
An array of arrays is the most intuitive representation for a matrix in Perl, since there is no direct support for two-dimensional arrays:
@matrix = (
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
); 
# Change 6, the element at row  1, column 2 to 100
$matrix[1][2] = 100;
Note that @matrix is a simple array whose elements happen to be references to anonymous arrays. Further, recall that $matrix[1][2] is a simpler way of saying $matrix[1]->[2].
Example 2.1 reads the file and creates the array of arrays structure for each matrix. Pay particular attention to the push statement (highlighted); it uses the symbolic reference facility to create variables (@{$matrix_name}) and appends a reference to a new row in every iteration. We are assured of newly allocated rows in every iteration because @row is local to that block, and when the if statement is done, its contents live on because we squirrel away a reference to the array's value. (Recall that it is the value that is reference counted, not the name.)
Example 2.1. Reading a Matrix from a File
sub matrix_read_file {
    my ($filename) = @_;
    open (F, $filename) || die "Could not open $filename: $!";
    while ($line = <F>) {
        chomp($line);
        next if $line =~ /^\s*$/; # skip blank lines
        if ($line =~ /^([A-Za-z]\w*)/) {
            $matrix_name = $1;
        } else {
           
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Professors, Students, Courses
This example shows how you might represent professor, student, and course data as hierarchical records and how to link them up. Assume that the data files look like this:
            #file: professor.dat
id          : 42343                #Employee Id
Name        : E.F.Schumacher
Office Hours: Mon 3-4, Wed 8-9
Courses     : HS201, SS343         #Course taught
...


#file: student.dat
id          : 52003                 # Registration id
Name        : Garibaldi
Courses     : H301, H302, M201      # Courses taken
...
#file: courses.dat
id          : HS201
Description : Small is beautiful
Class Hours : Mon 2-4, Wed 9-10, Thu 4-5
...
Each "id:" line starts a new record.
Among other tasks, let us say we are required to find out whether there is a scheduling conflict on professors' and students' hours. Because our focus is on data representation and getting a feel for Perl's reference syntax, we will look at implementing only some parts of the problem.
A hash table is a good representation for a heterogeneous record, as we mentioned earlier, so a student structure may be implemented like this:
$student{42343} = {
    'Name'    => 'E.F.Schumacher',
    'Courses' => [ ]};
A number of subtle design choices have been made here.
We could have replaced "foreign keys" (to use the database term) such as "HS201" with references to the corresponding course data structures. We didn't, because it is then tempting to directly dereference these references, in which case the student code is aware of how the course data is structured.
We maintain separate global hash tables for students, courses, and professors — yet another effort to keep mostly unrelated data completely separate and to make it possible to change a part of the system without affecting everyone.
There is one piece of data we haven't discussed before: time ranges. Both professors and courses have certain "busy" or "active" hours. What is a good representation for this? You might choose to represent the line "Mon 2-3, Tue 4-6" as follows:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Pass the Envelope
Let us say we are given a text file containing Academy Award (Oscar) winners by year and category, formatted as follows:
1995:Actor:Nicholas Cage
1995:Picture:Braveheart
1995:Supporting Actor:Kevin Spacey
1994:Actor:Tom Hanks
1994:Picture:Forrest Gump
1928:Picture:WINGS
We would like to provide the following services:
  • Given a year and category, print the corresponding entry.
  • Given a year, print all entries for that year.
  • Given a category, print the year and title of all entries for that category.
  • Print all entries sorted by category or by year.
Since we would like to retrieve entries by category or by year, we use a double indexing scheme, as shown in Figure 2.2.
Figure 2.2: Data structure to represent Oscar winners
Each entry includes a category, a year, and the name of the corresponding winner. We choose to keep this information in an anonymous array (an anonymous hash would do just as well). The two indices %year_index and %category_index map the year and category to anonymous arrays containing references to the entries. Here is one way to build this structure:
open (F, "oscar.txt") || die "Could not open database: $!";
%category_index = (); %year_index = ();
while ($line = <F>) {
    chomp $line;
    ($year, $category, $name) = split (/:/, $line);
    create_entry($year, $category, $name) if $name;
}

sub create_entry {             # create_entry (year, category, name)
    my($year, $category, $name) = @_;
    # Create an anonymous array for each entry
    $rlEntry = [$year, $category, $name];
    # Add this to the two indices
    push (@{$year_index {$year}}, $rlEntry);         # By Year
    push (@{$category_index{$category}}, $rlEntry);  # By Category
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Pretty-Printing
In building complicated data structures, it is always nice to have a pretty-printer handy for debugging. There are at least two options for pretty-printing data structures. The first is the Perl debugger itself. It uses a function called dumpValue in a file called dumpvar.pl, which can be found in the standard library directory. We can help ourselves to it, with the caveat that it is an unadvertised function and could change someday. To pretty-print this structure, for example:
@sample = (11.233,{3 => 4, "hello" => [6,7]});
we write the following:
require 'dumpvar.pl';
dumpValue(\@sample); # always pass by reference
This prints something like this:
0  11.233
1  HASH(0xb75dc0)
   3 => 4
   'hello' => ARRAY(0xc70858)
      0  6
      1  7
We will cover the require statement in Chapter 6. Meanwhile, just think of it as a fancy #include (which doesn't load the file if it is already loaded).
The Data::Dumper module available from CPAN is another viable alternative for pretty-printing. Chapter 10, covers this module in some detail, so we will not say any more about it here. Both modules detect circular references and handle subroutine and glob references.
It is fun and instructive to write a pretty-printer ourselves. Example 2.5 illustrates a simple effort, which accounts for circular references but doesn't follow typeglobs or subroutine references. This example is used as follows:
pretty_print(@sample); # Doesn't need a reference
This prints
11.233
{ # HASH(0xb78b00)
:  3 => 4
:  hello =>
:  :  [ # ARRAY(0xc70858)
:  :  :  6
:  :  :  7
:  :  ]
}
The following code contains specialized procedures (print_array, print_hash, or print_scalar) that know how to print specific data types. print_ref, charged with the task of pretty-printing a reference, simply dispatches control to one of the above procedures depending upon the type of argument given to it. In turn, these procedures may call
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Resources
  1. The FMTYEWTK series (Far More Than You Ever Wanted To Know). Tom Christiansen. Available at http://language.perl.com/info/documentation.html.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Typeglobs and Symbol Tables
We are symbols, and inhabit symbols.
—Ralph Waldo Emerson
This chapter discusses typeglobs, the symbol table, filehandles, formats, and the differences between dynamic and lexical scoping. At first sight, these topics may seem to lack a common theme, but as it happens, they are intimately tied to typeglobs and symbol tables.
Typeglobs are immensely useful. They allow us to efficiently create aliases of symbols, which is the basis for a very important module called Exporter that is used in a large number of freely available modules. Typeglobs can also be aliased to ordinary references in such a way that you don't have to use the dereferencing syntax; this is not only easier on the eye, it is faster too. At the same time, using typeglobs without understanding how they work can lead to a particularly painful problem called variable suicide. This might explain why most Perl literature gives typeglobs very little attention.
Closely related to typeglobs and symbol tables is the subject of dynamic versus lexical scoping (using local versus my). There are a couple of useful idioms that arise from these differences.
This is the only chapter that starts off by giving a picture of what is going on inside, rather than first presenting examples that you can use directly. The hope is that you will find the subsequent discussions really easy to follow.
Variables are either global or lexical (those tagged with my). In this section we briefly study how these two are represented internally. Let us start with global variables.
Perl has a curious feature that is typically not seen in other languages: you can use the same name for both data and nondata types. For example, the scalar $spud, the array @spud, the hash %spud, the subroutine &spud, the filehandle spud, and the format name spud are all simultaneously valid and completely independent of each other. In other words, Perl provides distinct namespaces for each type of entity. I do not have an explanation for why this feature is present. In fact, I consider it a rather dubious facility and recommend that you use a distinct name for each logical entity in your program; you owe it to the poor fellow who's going to maintain your code (which might be you!).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Perl Variables, Symbol Table, and Scoping
Variables are either global or lexical (those tagged with my). In this section we briefly study how these two are represented internally. Let us start with global variables.
Perl has a curious feature that is typically not seen in other languages: you can use the same name for both data and nondata types. For example, the scalar $spud, the array @spud, the hash %spud, the subroutine &spud, the filehandle spud, and the format name spud are all simultaneously valid and completely independent of each other. In other words, Perl provides distinct namespaces for each type of entity. I do not have an explanation for why this feature is present. In fact, I consider it a rather dubious facility and recommend that you use a distinct name for each logical entity in your program; you owe it to the poor fellow who's going to maintain your code (which might be you!).
Perl uses a symbol table (implemented internally as a hash table) to map identifier names (the string "spud" without the prefix) to the appropriate values. But you know that a hash table does not tolerate duplicate keys, so you can't really have two entries in the hash table with the same name pointing to two different values. For this reason, Perl interposes a structure called a typeglob between the symbol table entry and the other data types, as shown in Figure 3.1; it is just a bunch of pointers to values that can be accessed by the same name, with one pointer for each value type. In the typical case, in which you have unique identifier names, all but one of these pointers are null.
Figure 3.1: Symbol table and typeglobs
A typeglob is a real data type accessible from script space and has the prefix "*"; while you can think of it as a wildcard representing all values sharing the identifier name, there's no pattern matching going on. You can assign typeglobs, store them in arrays, create local versions of them, or print them out, just as you can for any fundamental type. More on this in a moment.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Typeglobs
Typeglobs, we mentioned earlier, can be localized (with local only) and assigned to one another. Assigning a typeglob has the effect of aliasing one identifier name to another. Consider
$spud   = "Wow!";
@spud   = ("idaho", "russet");
*potato= *spud;   # Alias potato to spud using typeglob assignment
print "$potato\n"; # prints "Wow!"
print @potato, "\n"; # prints "idaho russet"
Once the typeglob assignment is made, all entities that were called "spud" can now also be referred to as "potato"—the names are freely interchangeable. That is, $spud and $potato are the same thing, and so are the subroutines &spud and &potato. Figure 3.2 shows the picture after a typeglob assignment; both entries in the symbol table end up pointing to the same typeglob value.
Figure 3.2: Assigning *spud to *potato: both symbol table entries point to the same typeglob
The alias holds true until the typeglob is reassigned or removed. (We will shortly see how to remove a typeglob.) In the example, there is no subroutine called spud, but if we define it after the typeglobs have been assigned, that subroutine can also be invoked as potato. It turns out that the alias works the other way too. If you assign a new list to @potato, it will also be automatically accessible as @spud.
For now, there is no easy, intuitive way to get rid of an alias created by a typeglob assignment (you may reassign it, of course). You can, however, get temporary aliases using local, because it restores the typeglob's values at the end of the block.
Consider
$b = 10;
{
    local *b;    # Save *b's values
    *b = *a;     # Alias b to a
    $b = 20;     # Same as modifying $a instead
}                # *b restored at end of block
print $a;        # prints "20"
print $b;        # prints "10"
local
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Typeglobs and References
You might have noticed that both typeglobs and references point to values. A variable $a can be seen simply as a dereference of a typeglob ${*a}. For this reason, Perl makes the two expressions ${\$a} and ${*a} refer to the same scalar value. This equivalence of typeglobs and ordinary references has some interesting properties and results in three useful idioms, described here.
Earlier, we saw how a statement like *b = *a makes everything named "a" be referred to as "b" also. There is a way to create selective aliases, using the reference syntax:
*b = \$a;     # Assigning a scalar reference to a typeglob
Perl arranges it such that $b and $a are aliases, but @b and @a (or &b and &a, and so on) are not.
We get read-only variables by creating references to constants, like this:
*PI = \3.1415927;
# Now try to modify it.
$PI = 10;
Perl complains: "Modification of a read-only value attempted at try.pl line 3."
We will cover anonymous subroutines in the next chapter, so you might want to come back to this example later.
If you find it painful to call a subroutine indirectly through a reference (&$rs()), you can assign a name to it for convenience:
sub generate_greeting {
     my ($greeting) = @_;
		sub { print "$greeting world\n";}
}
$rs = generate_greeting("hello");
# Instead of invoking it as &$rs(), give it your own name.
*greet = $rs;
greet();    # Equivalent to calling &$rs(). Prints "hello world\n"
Of course, you can also similarly give a name to other types of references.
We have seen how references and typeglobs are equivalent (in the sense that references can be assigned to typeglobs). Perl also allows you to take references
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Filehandles, Directory Handles, and Formats
The built-in functions open and opendir initialize a filehandle and a directory handle, respectively:
open(F, "/home/calvin");
opendir (D, "/usr");
The symbols F and D are user-defined identifiers, but without a prefix symbol. Unfortunately, these handles don't have some basic facilities enjoyed by the important data types such as scalars, arrays, and hashes—you cannot assign handles, and you cannot create local handles:
local (G);   # invalid 
G = F;       # also invalid
Before we go further, it is important to know that the standard Perl distribution comes with a module called FileHandle that provides an object-oriented version of filehandles. This allows you to create filehandle "objects," to assign one to the other, and to create them local to the block. Similarly, directory handles are handled by DirHandle. Developers are now encouraged to use these facilities instead of the techniques described next. But you still need to wade through the next discussion because there is a large amount of freeware code in which you will see these constructs; in fact, the standard modules FileHandle, DirHandle, and Symbol, as well as the entire IO hierarchy of modules, are built on this foundation.
Why is it so important to be able to assign handles and create local filehandles? Without assignment, you cannot pass filehandles as parameters to subroutines or maintain them in data structures. Without local filehandles, you cannot create recursive subroutines that open files (for processing included files, which themselves might include more, for example).
The simple solution to this problem is to use typeglob assignment. That is, if you feel the urge to say,
G = F;
# or,
local(F);
you can write it instead in terms of typeglobs:
*G = *F;
# or, 
local (*F);
Similarly, if you want to store filehandles in data structures or create references to them, you use the corresponding typeglob. All I/O operators that require filehandles also accept typeglob references. Let us take a look at what we can do with assigning filehandles and localizing them (using typeglobs, of course).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Subroutine References and Closures
Many are called, but few are called back.
—Sister Mary Tricky
As with ordinary variables, subroutines can be named or anonymous, and Perl has a syntax for taking a reference to either type. Such references work rather like pointers to functions in C, and they can be used to create such sophisticated structures as the following:
  • Dispatch tables. Or data structures that map events to subroutine references. When an event comes in, a dispatch table is used to look up the corresponding subroutine. This is useful in creating large and efficient switch statements, finite state machines, signal handlers, and GUI toolkits.
  • Higher-order procedures. A higher-order procedure takes other procedures as arguments (like the C library procedure qsort) or returns new procedures. The latter feature is available only in interpreted languages such as Perl, Python, and LISP (hey, LISPers, you have lambda functions!).
  • Closures. A closure is a subroutine that, when created, packages its containing subroutine's environment (all the variables it requires and that are not local to itself).
In the following sections, we look at the syntax for taking and managing subroutine references and subsequently use them in the applications listed.
There's nothing particularly fancy or magical about subroutine references. In this section, we'll study how to create references to named and anonymous subroutines and how to dereference them.
We saw earlier that to take a reference to an existing variable, we prefix it with a backslash. It is much the same with subroutines. \&mysub is a reference to &mysub. For example:
sub greet {
    print "hello \n";
}
$rs = \&greet; # Create a reference to subroutine greet
It is important to note that we are notcalling the greet subroutine here, in the same way that we don't evaluate the value of a scalar when we take a reference to it.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Subroutine References
There's nothing particularly fancy or magical about subroutine references. In this section, we'll study how to create references to named and anonymous subroutines and how to dereference them.
We saw earlier that to take a reference to an existing variable, we prefix it with a backslash. It is much the same with subroutines. \&mysub is a reference to &mysub. For example:
sub greet {
    print "hello \n";
}
$rs = \&greet; # Create a reference to subroutine greet
It is important to note that we are notcalling the greet subroutine here, in the same way that we don't evaluate the value of a scalar when we take a reference to it.
Contrast this to the following code, which uses parentheses:
$rs = \&greet();
This expression likely doesn't do what you expect. It calls greet and produces a reference to its return value, which is the value of the last expression evaluated inside that subroutine. Since print executed last and returned a 1 or a (indicating whether or not it was successful in printing the value), the result of this expression is a reference to a scalar containing 1 or 0! These are the kind of mistakes that make you wish for type-safety once in a while!
To summarize, do not use parentheses when taking a subroutine reference.
You can create an anonymous subroutine simply by omitting the name in a subroutine declaration. In every other respect, the declaration is identical to a named one.
$rs = sub {
           print "hello \n";
      };
This expression returns a reference to the newly declared subroutine. Notice that because it is an expression, it requires the semicolon at the end, unlike the declaration of a named subroutine.
Dereferencing a subroutine reference calls the subroutine indirectly. As with data references, Perl does not care whether $rs
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using Subroutine References
Let's look at some common examples of using subroutine references: callback functions and higher-order procedures.
A callback function is an ordinary subroutine whose reference is passed around. The caller (who uses that reference) doesn't necessarily have an idea of which subroutine is getting invoked. Let's examine three simple examples involving callback functions: dispatch tables, signal handlers, and plotting functions.
A typical dispatch table is an array of subroutine references. The following example shows %options as a dispatch table that maps a set of command-line options to different subroutines:
%options = (       # For each option, call appropriate subroutine.
   "-h"         => \&help,
   "-f"         => sub {$askNoQuestions = 1},
   "-r"         => sub {$recursive = 1},
   "_default_"  => \&default,
);

ProcessArgs (\@ARGV, \%options); # Pass both as references
Some of these references in this code are to named subroutines. Others don't do much, so it is just simpler to code them as inline, anonymous subroutines. ProcessArgs can now be written in a very generic way. It takes two arguments: a reference to an array that it parses and a mapping of options that it refers to while processing the array. For each option, it calls the appropriate "mapped" function, and if an invalid flag is supplied in @ARGV, it calls the function corresponding to the string _default_.
ProcessArgs is shown in Example 4.1.
Example 4.1. ProcessArgs
ProcessArgs (\@ARGV, \%options); # Pass both as references
sub ProcessArgs {
    # Notice the notation: rl = ref. to array, rh = ref. to hash
    my ($rlArgs, $rhOptions) = @_;
    foreach $arg (@$rlArgs) {
        if (exists $rhOptions->{$arg}) {
            # The value must be a reference to a subroutine
            $rsub = $rhOptions->{$arg};
            &$rsub();   # Call it.
        } else {        #option does not exist.
            if (exists $rhOptions->{"_default_"}) {
                &{$rhOptions{"_default_"}};
            }
        }
    }
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Closures
Instead of returning data, a Perl subroutine can return a reference to a subroutine. This is really no different from any other way of passing subroutine references around, except for a somewhat hidden feature involving anonymous subroutines and lexical (my) variables. Consider
$greeting = "hello world";
$rs = sub {
    print $greeting;
};
&$rs();  #prints "hello world"
In this example, the anonymous subroutine makes use of the global variable $greeting. No surprises here, right? Now, let's modify this innocuous example slightly:
sub generate_greeting {
    my($greeting) = "hello world";
    return sub {print $greeting};
}
$rs = generate_greeting();
&$rs(); # Prints "hello world"
The generate_greeting subroutine returns the reference to an anonymous subroutine, which in turn prints $greeting. The curious thing is that $greeting is a my variable that belongs to generate_greeting. Once generate_greeting finishes executing, you would expect all its local variables to be destroyed. But when you invoke the anonymous subroutine later on, using &$rs(), it manages to still print $greeting. How does it work?
Any other expression in place of the anonymous subroutine definition would have used $greeting right away. A subroutine block, on the other hand, is a package of code to be invoked at a later time, so it keeps track of all the variables it is going to need later on (taking them "to go," in a manner of speaking). When this subroutine is called subsequently and invokes print "$greeting", the subroutine remembers the value that $greeting had when that subroutine was created.
Let's modify this a bit more to really understand what this idiom is capable of:
sub generate_greeting {
    my($greeting) = @_;     # $greeting primed by arguments
    return sub {
                 my($subject)= @_;
                 print "$greeting $subject \n";
           };
}
$rs1 = generate_greeting("hello");
$rs2 = generate_greeting("my fair");

# $rs1 and $rs2 are two subroutines holding on to different $greeting's
&$rs1 ("world") ;  # prints "hello world"
&$rs2 ("lady") ;   # prints "my fair lady"
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using Closures
Closures are used in two somewhat distinct ways. The most common usage is as "smart" callback procedures. The other idiom is that of "iterators" (or "streams," as they are known in the LISP world).
Since closures are subroutine references with a bit of private data thrown in, they are very convenient to use as callback procedures in graphical user interfaces.
Let's say you create a button using the Tk toolkit and give it a subroutine reference. When the button is pressed, it calls this subroutine back. Now if the same subroutine is given to two different buttons on the screen, there's a problem: How does the subroutine know which button is calling it? Simple. Instead of giving the button a reference to an ordinary subroutine, you give it a "smart" callback subroutine—a closure. This closure stores away some data specific to a button (such as its name), and when the subroutine is called, it magically has access to that data, as shown in Example 4.2.
This example creates two buttons that when clicked, print out their title strings. Though the discussion about packages and, specifically, the Tk module is slated for chapters to come, you might still understand the gist of the code in Example 4.2. For the moment, pay attention only to the part that uses closures (highlighted in boldface) and not to the mechanics of using the Tk module.
CreateButton creates a GUI button and feeds it a reference to an anonymous subroutine reference ($callback_proc), which holds on to $title, a my variable in its enclosing environment. When the user clicks on the button, the callback is invoked, whereupon it uses its stored value of $title.
Example 4.2. Passing Closures Instead of Ordinary Subroutines
use Tk;
# Creates a top level window
$topwindow = MainWindow->new();
# Create two buttons. The buttons print their names when clicked on. 
CreateButton($topwindow, "hello"); 
CreateButton($topwindow, "world");
Tk::MainLoop();  # Dispatch events.
#--------------------------------------------------------------------
sub CreateButton {
    my ($parent, 
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Comparisons to Other Languages
Tcl programmers rely heavily on dynamic evaluation (using eval) to pass around bits and pieces of code. While you can do this in Perl also, Perl's anonymous subroutines are packets of precompiled code, which definitely work faster than dynamic evaluation. Perl closures give you other advantages that are not available in Tcl: the ability to share private variables between different closures (in Tcl, they have to be global variables for them to be sharable) and not worry about variable interpolation rules (in Tcl, you have to take care to completely expand all the variables yourself using interpolation before you pass a piece of code along to somebody else).
Python offers a weak form of closures: a subroutine can pick up variables only from its immediate containing environment. This is called "shallow binding," while Perl offers "deep binding." Mark Lutz's Programming Python (O'Reilly, 1996) shows a workaround to achieve deep binding, by setting default arguments to values in the immediately enclosing scope.
I prefer the environment to handle this stuff automatically for me, as Perl does.
C++ supports pointers to subroutines but does not support closures. You have to use the callback object idiom wherever a callback subroutine needs some contextual data to operate. If you don't want a separate callback object, you can inherit your object from a standard callback class and override a method called, say, "execute," so that the caller can simply say callback_object->execute().
Java offers neither closures nor pointers to subroutines (methods). Interfaces can be used to provide a standardized callback interface so that the caller doesn't have to care about the specific class of the object (as long as it implements that interface).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Resources
Content preview·