BUY THIS BOOK
Add to Cart

Print Book $34.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £24.95

What is this?

Looking to Reprint this content?


Mastering Algorithms with Perl
Mastering Algorithms with Perl

By Jon Orwant, Jarkko Hietaniemi, John Macdonald
Price: $34.95 USD
£24.95 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction
Computer Science is no more about computers than astronomy is about telescopes.
—E. W. Dijkstra
In this chapter, we'll discuss how to "think algorithms"—how to design and analyze programs that solve problems. We'll start with a gentle introduction to algorithms and a not-so-gentle introduction to Perl, then consider some of the tradeoffs involved in choosing the right implementation for your needs, and finally introduce some themes pervading the field: recursion, divide-and-conquer, and dynamic programming.
An algorithm is simply a technique—not necessarily computational—for solving a problem step by step. Of course, all programs solve problems (except for the ones that create problems). What elevates some techniques to the hallowed status of algorithm is that they embody a general, reusable method that solves an entire class of problems. Programs are created; algorithms are invented. Programs eventually become obsolete; algorithms are permanent.
Of course, some algorithms are better than others. Consider the task of finding a word in a dictionary. Whether it's a physical book or an online file containing one word per line, there are different ways to locate the word you're looking for. You could look up a definition with a linear search, by reading the dictionary from front to back until you happen across your word. That's slow, unless your word happens to be at the very beginning of the alphabet. Or, you could pick pages at random and scan them for your word. You might get lucky. Still, there's obviously a better way. That better way is the binary search algorithm, which you'll learn about in Chapter 5. In fact, the binary search is provably the best algorithm for this task.
We'll use binary search to explore what an algorithm is, how we implement one in Perl, and what it means for an algorithm to be general and efficient. In what follows, we'll assume that we have an alphabetically ordered list of words, and we want to determine where our chosen word appears in the list, if it even appears at all. In our program, each word is represented in Perl as a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Is an Algorithm?
An algorithm is simply a technique—not necessarily computational—for solving a problem step by step. Of course, all programs solve problems (except for the ones that create problems). What elevates some techniques to the hallowed status of algorithm is that they embody a general, reusable method that solves an entire class of problems. Programs are created; algorithms are invented. Programs eventually become obsolete; algorithms are permanent.
Of course, some algorithms are better than others. Consider the task of finding a word in a dictionary. Whether it's a physical book or an online file containing one word per line, there are different ways to locate the word you're looking for. You could look up a definition with a linear search, by reading the dictionary from front to back until you happen across your word. That's slow, unless your word happens to be at the very beginning of the alphabet. Or, you could pick pages at random and scan them for your word. You might get lucky. Still, there's obviously a better way. That better way is the binary search algorithm, which you'll learn about in Chapter 5. In fact, the binary search is provably the best algorithm for this task.
We'll use binary search to explore what an algorithm is, how we implement one in Perl, and what it means for an algorithm to be general and efficient. In what follows, we'll assume that we have an alphabetically ordered list of words, and we want to determine where our chosen word appears in the list, if it even appears at all. In our program, each word is represented in Perl as a scalar, which can be an integer, a floating-point number, or (as in this case) a string of characters. Our list of words is stored in a Perl array: an ordered list of scalars. In Perl, all scalars begin with an $ sign, and all arrays begin with an @ sign. The other common datatype in Perl is the hash, denoted with a % sign. Hashes "map" one set of scalars (the "keys") to other scalars (the "values").
Here's how our binary search works. At all times, there is a range of words, called a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Efficiency
Central to the study of algorithms is the notion of efficiency—how well an implementation of the algorithm makes use of its resources. There are two resources that every programmer cares about: space and time. Most books about algorithms focus on time (how long it takes your program to execute), because the space used by an algorithm (the amount of memory or disk required) depends on your language, compiler, and computer architecture.
There's often a tradeoff between space and time. Consider a program that determines how bright an RGB value is; that is, a color expressed in terms of the red, green, and blue phosphors on your computer's monitor or your TV. The formula is simple: to convert an (R,G,B) triplet (three integers ranging from 0 to 255) to a brightness between 0 and 100, we need only this statement:
$brightness = $red * 0.118 + $green * 0.231 + $blue * 0.043;
Three floating-point multiplications and two additions; this will take any modern computer no longer than a few milliseconds. But even more speed might be necessary, say, for high-speed Internet video. If you could trim the time from, say, three milliseconds to one, you can spend the time savings on other enhancements, like making the picture bigger or increasing the frame rate. So can we calculate $brightness any faster? Surprisingly, yes.
In fact, you can write a program that will perform the conversion without any arithmetic at all. All you have to do is precompute all the values and store them in a lookup table—a large array containing all the answers. There are only 256 × 256 × 256 = 16,777,216 possible color triplets, and if you go to the trouble of computing all of them once, there's nothing stopping you from mashing the results into an array. Then, later, you just look up the appropriate value from the array.
This approach takes 16 megabytes (at least) of your computer's memory. That's memory that other processes won't be able to use. You could store the array on disk, so that it needn't be stored in memory, at a cost of 16 megabytes of disk space. We've saved time at the expense of space.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Recurrent Themes in Algorithms
Each algorithm in this book is a strategy—a particular trick for solving some problem. The remainder of this chapter looks at three intertwined ideas, recursion, divide and conquer, and dynamic programming, and concludes with an observation about representing data.
re·cur·sion \ri-'ker-zhen\ n See RECURSION
Something that is defined in terms of itself is said to be recursive. A function that calls itself is recursive; so is an algorithm defined in terms of itself. Recursion is a fundamental concept in computer science; it enables elegant solutions to certain problems. Consider the task of computing the factorial of , denoted and defined as the product of all the numbers from 1 to . You could define a factorial() subroutine without recursion:
# factorial($n) computes the factorial of $n,
#   using an iterative algorithm.
sub factorial {
    my ($n) = shift;
    my ($result, $i) = (1, 2);
    for ( ; $i <= $n; $i++) {
        $result *= $i;
    }
    return $result;
}
It's much cleaner to use recursion:
# factorial_recursive($n) computes the factorial of $n,
#   using a recursive algorithm.
sub factorial_recursive {
    my ($n) = shift;
    return $n if $n <= 2;
    return $n * factorial_recursive($n - 1);
}
Both of these subroutines are , since computing the factorial of requires multiplications. The recursive implementation is cleaner, and you might suspect faster. However, it takes four times as long on our computers, because there's overhead involved whenever you call a subroutine. The nonrecursive (or iterative) subroutine just amasses the factorial in an integer, while the recursive subroutine has to invoke itself repeatedly—and subroutine invocations take a lot of time.
As it turns out, there is an algorithm to approximate the factorial. That speed comes at a price: it's not exact.
sub factorial_approx {
    return sqrt (1.5707963267949 * $_[0]) *
        (($_[0] / 2.71828182845905) ** $_[0]);
}
We could have implemented binary search recursively also, with binary_search() accepting $low and $high as arguments, checking the current word, adjusting
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Basic Data Structures
What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?
—Larry Wall
There are calendars that hang on a wall, and ones that fit in your pocket. There are calendars that have a separate row for each hour of the day, and ones that squeeze a year or two onto a page. Each has its use; you don't use a five year calendar to check whether you have time for a meeting after lunch tomorrow, nor do you use a day-at-a-time planner to schedule a series of month-long projects. Every calendar provides a different way to organize time—and each has its own strengths and weaknesses. Each is a data structure for time.
In this chapter and the next, we describe a wide variety of data structures and show you how to choose the ones that best suit your task. All computer programs manipulate data, usually representing some phenomenon in the real world. Data structures help you organize your data and minimize complexity; a proper data structure is the foundation of any algorithm. No matter how fast an algorithm is, at bottom it will be limited by how efficiently it can access your data.
As we explore the data structures fundamental to any study of algorithms, we'll see that many of them are already provided by Perl, and others can be easily implemented using the building blocks that Perl provides. Some data structures, such as sets and graphs, merit a chapter of their own; others are discussed in the chapter that makes use of them, such as B-trees in Chapter 5. In this chapter, we explore the data structures that Perl provides: arrays, hashes, and the simple data structures that result naturally from their use. In Chapter 3, we'll use those building blocks to create the old standbys of computer science, including linked lists, heaps, and binary trees.
There are many kinds of data structures, and while it's important for a programming language to provide built-in data structures, it's even more important to provide convenient and powerful ways to develop new structures that meet the particular needs of the task at hand. Just as computer languages let you write subroutines that enhance how you process data, they should also let you create new structures that give you new ways to store data.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Perl's Built-in Data Structures
Let's look at Perl's data structures and investigate how they can be combined to create more complex data structures tailored for a particular task. Then, we'll demonstrate how to implement the favorite data structures of computer science: queues and stacks. They'll all be used in algorithms in later chapters.
Many Perl programs never need any data structures other than those provided by the language itself, shown in Table 2-1.
Table 2-1: Basic Perl Datatypes
Type and Designating Symbol
Meaning
$scalar
number
integer or float
string
arbitrary length sequence of characters
reference
"pointer" to another Perl data structure
object
a Perl data structure that has been blessed into a class (accessed through a reference)
@array
an ordered sequence of scalars indexed by integers; arrays are sometimes called lists, but the two are not quite identical
%hash an unordered collection of scalars selected by strings (also known as associative arrays, and in some languages as dictionaries)
Every scalar contains a single value of any of the subtypes. Perl automatically converts between numbers and strings as necessary:
# start with a string
$date = "98/07/22";

# extract the substrings containing the numeric values
($year, $month, $day) = ($date =~ m[(\d\d)/(\d\d)/(\d\d)]);

# but they can just be used as numbers
$year += 1900;                            # Y2K bug!
$month = $month_name[$month-1];

# and then again as strings
$printable_date = "$month $day, $year";
Arrays and hashes are collections of scalars. The key to building more advanced data structures is understanding how to use arrays and hashes whose scalars also happen to be references.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Build Your Own Data Structure
The big trick for constructing elaborate data structures is to store references in arrays and hashes. Since a reference can refer to any type of variable you wish, and since arrays and hashes can contain multiple scalars (any of which can be references), you can create arbitrarily complicated structures.
One convenient way to manage complex structures is to augment them into objects. An object is a collection of data tied internally to a collection of subroutines called methods that provide customized access to the data structure.
If you adopt an object-oriented approach, your programs can just call methods instead of plodding through the data structure directly. A Point object might contain explicit values for x- and y-coordinates, while the corresponding Point class might have methods to synthesize and coordinates from them. This approach isolates the rest of the code from the internal representation; indeed, as long as the methods behave, the underlying structure can be changed without requiring any change to the rest of the program. You could change Point to use angular coordinates internally instead of Cartesian coordinates, and the x(), y(), rho(), and theta() methods would still return the correct values.
The main disadvantage of objects is speed. Invoking a method requires a subroutine call, while a direct implementation of a data structure can often use inline code, avoiding the overhead of subroutines. If you're using inheritance, which allows one class to use the methods of another, the situation becomes even more grim. Perl has to search through a hierarchy of classes to find the method. While Perl caches the result of that search, that first search takes time.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Simple Example
Consider an address—you know, what your grandparents used to write on paper envelopes for delivery by someone in a uniform. There are many components of an address: apartment or suite number, street number (perhaps with a fraction or letter), street name, rural route, municipality, state or province, postal code, and country. An individual location uses a subset of those components for its address. In a small village, you might use only the recipient's name.
Addresses seem simple only because we use them every day. Like many real-world phenomena, there are complicated relationships between the components. To deal with addresses, computer programs need an understanding of the disparate components and the relationships between them. They also need to store the components so that necessary manipulations can be made easily: whatever structure we use to store our addresses, it had better be easy to retrieve or change individual fields. You'd rather be able to say $address{city} than have to parse the city out of the middle of an address string with something like get_address(line=>4,/^[\s,]+/). There are many different data structures that could do the job. We'll now consider a few alternatives, starting with simple arrays and hashes. We could use one array per address:
@Watson_Address = (
    "Dr. Watson",
    "221b Baker St.",
    "London",
    "NW1",
    "England",
);
@Sam_Address = (
    "Sam Gamgee",
    "Bagshot Row",
    "Hobbiton",
    "The Shire",
);
Or, we could use a hash:
%Watson_Address = (
    name    => "Dr. Watson",
    street  => "221b Baker St.",
    city    => "London",
    zone    => "NW1",
    country => "England",
);
%Sam_Address = (
    name    => "Sam Gamgee",
    street  => "Bagshot Row",
    city    => "Hobbiton",
    country => "The Shire",
);
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Perl Arrays: Many Data Structures in One
Perl's arrays are more powerful than the arrays provided by C and many other languages. The built-in operators for manipulating arrays allows Perl programs to provide all of the capabilities for which other languages must resort to a multitude of different data structures.
Algorithm analysis often assumes that changing the length of an array is expensive, making it important to determine the exact size of arrays before the program starts. For this reason, many data structures are designed to restrict the way that they are accessed so that it is easier to implement them efficiently in such languages.
But in Perl, arrays can vary in length dynamically. Extending, contracting, and reordering mechanisms are built into the language. The traditional costs of reorganizing arrays are swept under the rug, but Perl provides a very plush rug indeed and the sweepings are rarely large enough to be detectable.
When an array must be grown, Perl allocates multiple additional elements at one time, choosing a number proportional to the current size of the array. That way, most array operations won't require individual allocation, but instead use one of the extra entries that was allocated the last time an allocation was required.
Traditional algorithms also take pains to ensure that structures that are no longer needed are carefully tracked so that their memory can be freed and reused for other purposes. Perl provides automatic garbage collection: detecting when data is no longer accessible and freeing it. Few Perl algorithms need to manage their own garbage (we'll discuss an exception in Section 3.1 in Chapter 3.)
The Perl programmer usually needn't worry about these issues. The result is code that's easier to understand and modify, making it possible to implement major improvements that more than make up for any minor inefficiencies that might occur from Perl's helpfulness.
If you are concerned that some of the costs hidden by Perl are too high, you can investigate as follows:
  1. Measure your program to see whether it is too slow—if it's not, stop worrying. There is a great danger that an attempt to speed up a program will make it harder to understand, harder to adapt to future needs, more likely to have bugs, and finally, not noticeably faster anyhow.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Advanced Data Structures
Much more often, strategic breakthrough will come from redoing the representation of the data or tables. This is where the heart of a program lies. Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious.
—Frederick P. Brooks, Jr., The Mythical Man-Month
There is a dynamic interplay between data structures and algorithms. Just as the right data structure is necessary to make some algorithms possible, the right algorithms are necessary to maintain certain data structures. In this chapter, we'll explore advanced data structures—structures that are extraordinarily useful, but complex enough to require algorithms of their own to keep them organized.
Despite the versatility of Perl's hashes and arrays, there are traditional data structures that they cannot emulate so easily. These structures contain interrelated elements that need to be manipulated in carefully prescribed ways. They can be encapsulated in objects for ease of programming, but often only at a high performance cost.
In later chapters, algorithms will take center stage, and the data structures in those chapters will be chosen to fit the algorithm. In this chapter, however, the data structures take center stage. We'll describe the following advanced data structures:
Linked list
A chain of elements linked together.
Binary tree
A pyramid of elements linked together, each with two child elements.
Heap
A collection of elements linked together in a tree-like order so that the smallest is easily available.
We'll leave some other structures for later in the book:
B-tree
A pyramid of elements where each element can have references to many others (in Chapter 5).
Set
An unstructured collection in which the only important information is who belongs and who doesn't in Chapter 6.
Graph
A collection of nodes and edges connecting them in Chapter 8.
Like a simple array, a linked list contains elements in a fixed order. In the discussion in the previous chapter of the splice
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Linked Lists
Like a simple array, a linked list contains elements in a fixed order. In the discussion in the previous chapter of the splice operator used for Perl lists, we described how splicing elements into or out of the middle of a large array can be expensive. To cut down the expense of copying large chunks of an array you can use a linked list. Instead of using memory as compactly as possible, placing one element right after the previous one as an array does, a linked list uses a separate structure for each element. Each of these structures has two fields: the value of the element and a reference to the next element in the list.
Linked lists are useful for ordering elements where you have to insert or delete them often, because you can just change a reference instead of copying the entire list. Nearly all word processors store text as a linked list. That's why cutting and pasting large amounts of text is so quick. Figure 3-1 shows the memory layout of the two types of lists.
Figure 3-1: Perl array and linked list
One difference between the array and the linked list is obvious: the linked list uses more space. Instead of 5 values in 1 structure, there are 10 values in 5 structures. In addition to the visible extra space for the 5 links, extra space is needed for the internal Perl overhead for each separate array.
Since the linked list contains 5 separate elements, it cannot be created as simply as an array. Often, you will find it easiest to add elements to the front of a list, which means that you must create it backwards. For instance, the following code creates a linked list of the first 5 squares:
$list = undef;
foreach (reverse 1..5) {
    $list = [ $list, $_ * $_ ];
}
If you are not used to dealing with references, or links, Figure 3-2 will be helpful in understanding how the list grows with each iteration of that loop.
Figure 3-2: Creating and adding links to a list
Each element of the linked list is a list containing two scalars. The first scalar, [0], is a reference that points to the next element of the linked list. The second scalar,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Circular Linked Lists
One common variation of the linked list is the circular linked list, which has no beginning and no end. Here, instead of using undef to denote the end of the list, the last element points back to the first. Because of the circular link, the idea of the head and tail of the list gets fuzzier. The list pointer (e.g., $list) is no longer the only way to access the element at the head of the linked list—you can get to it from any element by following the right number of links. This means that you can simply reassign the list pointer to point to a different element to change which element is to be considered the head.
You can use circular lists when a list of items to be processed can require more than one processing pass for each item. A server process might be an example, since it would try to give each of its requests some time in turn rather than permit one possibly large request from delaying all of the others excessively.
A circular linked list gives you most of the capabilities of a deque. You can easily add elements to the end or beginning. (Just keep the list pointer always pointing at the tail, whose successor is by definition the head. Add new elements after the tail, either leaving the list pointer unchanged or changing it to point to the new element. The first option leaves the new element at the head of the list, while the second leaves the new element at the tail.)
Removing elements from the head is equally easy. Deleting the element after the tail removes the head element. However, you can't delete the last element of the list without scanning the entire list to find its predecessor. This is the one way that a circular linked list is less capable than a deque.
The circular linked list also has one capability that a deque lacks: you can inexpensively rotate the circle simply by reassigning the list pointer. A deque implemented as an array requires two splice operations to accomplish a rotation, which might be expensive if the array is long.
In practice, however, the most common change to the list pointer is to move it to the next element, which is an inexpensive operation for either a circular linked list or a deque (just
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Garbage Collection in Perl
Normally, Perl determines when a value is still needed using a technique called reference counting, which is simple and quick and creates no unpredictable delays in operation. The Perl interpreter keeps a reference counter for each value. When a value is created and assigned to a variable, the counter is set to one. If an additional reference is created to point to it, the count is incremented. A reference can go away for two reasons. First, when a block is exited, any variables that were defined in that scope are destroyed. The reference counts for their values is decremented. Second, if a new value is assigned that replaces a reference value, the count of the value that was previously referenced is decremented. Whenever a reference count goes to zero, there are no more variables referring to that value, so it can be destroyed. (If the deleted value is a reference, deletion causes a cascading effect for a while, since destroying the reference can reduce the reference count of the value that it refers to.)
my $p;
{
    my $x = "abc";
    my $y = "def";
    $p = \$x;       # the value "abc" now has a count of two
}
# "def" is freed
# "abc" remains in use

$p = 1;
# "abc" is freed
At the end of the block, $y has gone out of scope. Its value, "def", had a count of 1 so it can be freed. $x has also gone out of scope, but its value "abc" had a count of 2. The count is decremented to 1 and the value is not freed—it is still accessible through $p. Later, $p is reassigned, overwriting the reference to "abc". This means that the count for "abc" is decremented. Since its count is now zero, it is freed.
Reference counting is usually quite effective, but it breaks down when you have a circle of reference values. When the last outside variable that points to any of them is destroyed or changed, they all still have a nonzero count. Here's an example (shown in Figure 3-7):
# start a new scope
{
    # two variables
    my $p1 = 1;
    my $p2 = 2;

    # point them at each other
    $p1 = \$p2;
    $p2 = \$p1;
}
# end scope
Figure 3-7: Memory leak caused by deleting circular references
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Doubly-Linked Lists
A prime candidate for the cleanup mechanism just described is the doubly-linked list. Instead of one link field in each element, there are two. One points to the next element, as in the previous linked lists; the other points back to the previous element. It is also common for the ends of a doubly-linked list to be joined in a circle. Note that this data structure creates cycles from the circular linking of the ends, as well as a cycle from the forward and backward links between every adjacent pair of elements.
The link to the previous element means that it is not necessary to search through the entire list to find a node's predecessor. It is also possible to move back multiple positions on the list, which you can't do by keeping only a predecessor pointer. Of course, this flexibility comes at a cost: whenever a link is changed, the back link must also be changed, so every linking operation is twice as expensive. Sometimes it's worth it.
When using circular doubly-linked lists, it is useful to keep an element linked to itself when it is not on any list. That bit of hygiene makes it possible to have many of the operations work consistently for either a single element or a list of multiple elements. Consider, for example, the append() and prepend() functions about to be described, which insert one or many elements before or after a specific element. These functions work on a list that has only a single element so long as it points to itself. They fail if you have removed that element from another list without relinking the standalone element to point to itself. (The code for a singly-linked list earlier in this chapter overwrites the link field whenever it inserts an element into a list, so the code will work fine whatever old value was in the link field.)
Here's a package double that can carry out doubly-linked list operations. Parts of it are designed to coexist with the package double_head shown later in this chapter. The new method is a typical object creation function. The _link_to method is only for internal use; it connects two elements as neighbors within a list:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Infinite Lists
An interesting variation on lists is the infinite list, described by Mark-Jason Dominus in The Perl Journal, Issue #7. (The module is available from http://tpj.com/tpj/programs.)Infinite lists are helpful for cases in which you'll never be able to look at all of your elements. Maybe the elements are tough to compute, or maybe there are simply too many of them. For example, if your program had an occasional need to test whether a particular number belongs to an infinite series (prime numbers or Fibonacci numbers, perhaps), you could keep an infinite list around and search through it until you find a number that is the same or larger. As the list expands, the infinite list would cache all of the values that you've already computed, and would compute more only if the newly requested number was "deeper" into the list.
In infinite lists, the element's link field is always accessed with a next() method. Internally, the link value can have two forms. When it is a normal reference pointing to the next element, the next() method just returns it immediately. But when it is a code reference, the next() method invokes the code. The code actually creates the next node and returns a reference to it. Then, the next() method changes the link field of the old element from the code reference to a normal reference pointing to the newly found value. Finally, next() returns that new reference for use by the calling program. Thus, the new node is remembered and will be returned immediately on subsequent calls to the next() method. The new node's link field will usually be a code reference again—ready to be invoked in its turn, if you choose to continue advancing through the list when you've dealt with the current (freshly created) element.
Dominus describes the code reference instances as a promise to compute the next and subsequent elements whenever the user actually needs them.
If you ever reach a point in your program when you will never again need some of the early elements of the infinite list, you can just forget them by reassigning the list pointer to refer to the first element that you might still need and letting Perl's garbage collection deal with the predecessors. In this way, you can use a potentially huge number of elements of the list without requiring that they all fit in memory at the same time. This is similar to processing a file by reading it a line at a time, forgetting previous lines as you go along.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Cost of Traversal
Finding an element that is somewhere on a linked list can be a problem. All you can do is to scan through the list until you find the element you want: an O(N) process.
You can avoid the long search if you keep the list in order so that the item you will next use is always at the front of the list. Sometimes that works very well, but sometimes it just shifts the problem. To keep the list in order, new items must be inserted into their proper place. Finding that proper place, unless it is always near an end of the list, requires a long search through the list—just what we were trying to avoid by ordering entries.
If you break the list into smaller lists, the smaller lists will be faster to search. For example, a personal pocket address book provides alphabetic index tabs that separate your list of addresses into 26 shorter lists.
Dividing the list into pieces only postpones the problem. An unorganized address list becomes hard to use after a few dozen entries. The addition of tabbed pages will allow easy handling of a few hundred entries, about ten times as many. (Twenty-six tabbed pages does not automatically mean you are 26 times as efficient. The book becomes hard to use when the popular pages like S or T become long, while many of the less heavily used pages would still be relatively empty.) But there is another data structure that remains neat and extensible: a binary tree.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Binary Trees
A binary tree has elements with pointers, just like a linked list. However, instead of one link to the next element, it has two, called left and right.
In the address book, turning to a page with an index tab reduces the number of elements to be examined by a significant factor. But after that, subsequent decisions simply eliminate one element from consideration; they don't divide the remaining number of elements to search. Binary trees offer a huge speed-up in retrieving elements because the program makes a choice as it examines every element. With binary trees, every decision removes an entire subtree of elements from consideration.
To proceed to the next element, the program has to decide which of these two links to use. Usually, the decision is made by comparing the value in the element with the value that you are searching for. If the desired value is less, take the left link; if it is more, take the right link. Of course, if it is equal, you are already at the desired element. Figure 3-8 shows how our list of square numbers might be arranged in a binary tree. A word of caution: computer scientists like to draw their trees upside down, with the root at the top and the tree growing downwards. You can spot budding computer scientists by the fact that when other kids climb trees, they reach for a shovel.
Figure 3-8: Binary tree
Suppose you were trying to find Macdonald in an address book that contained a million names. After choosing the M "page" you have only 100,000 names to search. But, after that, it might take you 100,000 examinations to find the right element.
If the address book were kept in a binary tree, it would take at most four checks to get to a branch containing less than 100,000 elements. That seems slower than jumping directly to the M "page", but you continue to halve the search space with each check, finding the desired element with at most 20 additional checks. The reductions combine so that you only need to do checks.
In the 2,000-page Toronto phone book (with about 1,000,000 names), four branches take you to the page "Lee" through "Marshall." After another six checks, you're searching only Macdonalds. Ten more checks are required to find the right entry—there are a lot of those Macdonalds out there, and the Toronto phone book does not segregate those myriad MacDonalds (capital D). Still, all in all, it takes only 20 checks to find the name.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Heaps
A binary heap is an interesting variation on a binary tree. It is used when the only important operations are (1) finding (and removing) the smallest item in a collection and (2) adding additional elements to the collection. In particular, it does not support accessing items in random order. Focusing on doing a single task well allows a heap to be more efficient at finding the smallest element.
A heap differs from a standard binary tree in one crucial way: the ordering principle. Instead of completely ordering the entire tree, a heap requires only that each node is less than either of its subnodes. A heap imposes no particular order on the subnodes. It is sorted from the leaves toward the root, and a parent is always smaller than a child, but there is no order specified between siblings. This means you are not able to find a particular node without searching the entire tree; if a node is not the root, you have no way to decide whether to go left or right.
So use a heap only if you won't be using it to look for specific nodes (though you might tolerate rare searches, or maintain external info for finding elements). So why would you use a heap? If you are always interested only in the smallest value, it is obtained in time and it can be removed and the heap updated in time. Since you don't keep the heap's tree fully ordered, operations on the heap can be carried out faster. We will see heaps used as a component of many algorithms through the rest of this book.
One example of heaps is the list of tasks to be executed by an operating system. The OS will have many processes, some of which are ready to be run. When the OS is able to run a process, it would like to quickly choose the highest priority process that is ready. Keeping the available processes fully sorted would accomplish this, of course, but much of that sorting effort would be wasted. The first two or three processes are likely to be run in order, but as they are running, external events will make additional processes ready to run and those processes could easily be higher in priority than any of the processes that are already waiting to run. Perhaps one process will kill other processes; they then will have to be removed from their position in the middle of the queue.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Binary Heaps
We'll show a relatively simple heap implementation algorithm first: binary heap. There are faster algorithms, but the simple heap algorithm will actually be more useful if you want to include some heap characteristics within another data structure. The faster algorithms—the binomial heap and the Fibonacci heap—are more complicated. We have coded them into modules that are available from CPAN. Their interface is described a little later. The following table (taken from Cormen et al.) compares the performance of the three forms of heap:
Binary Heap (worst case)
Binomial Heap (worst case)
Fibonacci Heap (amortized)
create empty heap
θ(1)
θ(1)
θ(1)
insert new element
θ(log N)
θ(log N)
θ(1)
view minimum
θ(1)
θ(log N)
θ(1)
extract minimum
θ(log N)
θ(log N)
θ(log N)
union two heaps
θ(N)
θ(log N)
θ(1)
decrease key
θ(log N)
θ(log N)
θ(1)
delete element
θ(log N)
θ(log N)
θ(log N)
Note that the amortized bounds for Fibonacci heap are not worst-case bounds. Some of the θ(1) operations can take θ(log N) time, but that happens rarely enough that the average time is guaranteed to be θ(1) even for those operations.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Janus Heap
We came up with an interesting augmentation of the binary heap. It was prompted by considering how to provide a heap that limited the maximum number of elements that would ever be stored in the heap. When an attempt to add a new element was made to a full heap, the largest element would be discarded to make room (if it was larger than the provided element). But the heap is organized to provide easy access to the smallest element, not the largest! Our solution was to heap-order the array toward its tail end, using the inverse comparison, to find the largest element. Since the heap has two heads, we called it Janus heap. While it does solve the original desire for a bounded heap, an attempt to use it to sort the entire array failed—it is quite easy to find arrays that are heap-ordered from both ends but not fully sorted, e.g., the array ( 1, 3, 2, 4 ). There are unexplored possibilities for further development here—applying bidirectional heap ordering to slices of the full array seems to be worth examining, for example.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Heap Modules
The CPAN has three different implementations of heaps, written by John Macdonald. The first, Heap::Binary, uses the array and computed links described earlier. The other two, Heap::Binomial and Heap::Fibonacci, use separate nodes with links of varying complexities. Both of them use a separate structure for each element in the heap instead of sharing a common array as is done with binary heaps and use an asymmetric hierarchy instead of a fully balanced binary tree. This is advantageous because merging multiple heaps is much faster, and Fibonacci heaps delay many of the operations and perform a number of them together, making the amortized cost instead. The actual algorithms implemented are described in detail in the book Introduction to Algorithms, by Cormen, Leiserson, and Rivest.
All three modules use a common interface, so you can switch from one to another simply by changing which package you load with use and specify for the new() function. In practice, if you need to use one of these modules (rather than managing existing arrays as described earlier) you will be best off using Heap::Fibonacci. There are two possible exceptions. One is if your problem is small enough that the time required to load the larger Fibonacci package is significant. The other is if your problem is precisely the wrong size for the memory management of your operating system: the extra memory requirements of the Heap::Fibonacci causes significant degradation, but Heap::Binary is small enough that no degradation occurs. Neither case is especially likely, so use Heap::Fibonacci.
The interface used is as follows:
use Heap::Fibonacci;
# or Heap::Binary or Heap::Binomial

$heap = new Heap::Fibonacci;
# or Heap::Binary or Heap::Binomial

# Add a value (defined below) into the heap.
$heap->add($val);

# Look at the smallest value.
$val = $heap->minimum;

# Remove the smallest value.
$val = $heap->extract_minimum;

# Merge two heaps - $heap2 will end up empty; all of its
# elements will be merged into $heap.
$heap->absorb($heap2);

# Two operations on an element:
#  1. Decrease an item's value.
$val->val($new_value);
$heap->decrease_key($val);

#  2. Remove an element from the heap.
$heap->delete($val);
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Future CPAN Modules
A future release of the Heaps module will provide the ability to inherit the heap forms in an ISA arrangement. That will allow user-provided elements to be put directly onto the heap instead of having to use the heap method to connect the user data structure to a separate Elem structure used to determine its heap order. Additionally, the routines to apply binary heap ordering to a user-provided array will be put in a separate module called Array::Heap.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Sorting
The Librarian had seen many weird things in histime,butthathadtobethe57thstrangest. [footnote:hehadatidymind]
—Terry Pratchett, Moving Pictures
Sorting—the act of comparing and rearranging a collection of items—is one of the most important tasks computers perform. Sorting crops up everywhere; whenever you have a collection of items that need to be processed in a particular order, sorting helps you do it quickly.
In this chapter, we will explain what sorting is, how to do it efficiently using Perl's own sort function, what comparing actually means, and how you can code your own sort algorithms with Perl.
Sorting seems so simple. Novices don't see why it should be difficult, and experts know that there are canned solutions that work very well. Nevertheless, there are tips that will speed up your sorts, and traps that will slow them down. We'll explore them in this section. But first, the basics.
As in the two previous chapters, we'll use addresses for our demonstrations. Addresses are an ideal choice, familiar to everyone while complex enough to demonstrate the most sophisticated attributes of data structures and algorithms.
On to sorting terminology. The items to be sorted are called records; the parts of those items used to determine the order are called keys or sometimes fields. The difference is subtle. Sometimes the keys are the records themselves, but sometimes they are just pieces of the records. Sometimes there is more than one key.
Consider three records from a telephone book:
Munro, Alice     15 Brigham Road          623-2448
Munro, Alice     48 Hammersley Place      489-1073
Munro, Alicia    62 Evergreen Terrace     623-6099
The last names are the primary keys because they are the first criterion for ordering entries. When two people have the same last name, the first names must be considered; those are the secondary keys. In the example above, even that isn't enough, so we need tertiary keys: the street addresses. The rest of the data is irrelevant to our sort and is often called
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
An Introduction to Sorting
Sorting seems so simple. Novices don't see why it should be difficult, and experts know that there are canned solutions that work very well. Nevertheless, there are tips that will speed up your sorts, and traps that will slow them down. We'll explore them in this section. But first, the basics.
As in the two previous chapters, we'll use addresses for our demonstrations. Addresses are an ideal choice, familiar to everyone while complex enough to demonstrate the most sophisticated attributes of data structures and algorithms.
On to sorting terminology. The items to be sorted are called records; the parts of those items used to determine the order are called keys or sometimes fields. The difference is subtle. Sometimes the keys are the records themselves, but sometimes they are just pieces of the records. Sometimes there is more than one key.
Consider three records from a telephone book:
Munro, Alice     15 Brigham Road          623-2448
Munro, Alice     48 Hammersley Place      489-1073
Munro, Alicia    62 Evergreen Terrace     623-6099
The last names are the primary keys because they are the first criterion for ordering entries. When two people have the same last name, the first names must be considered; those are the secondary keys. In the example above, even that isn't enough, so we need tertiary keys: the street addresses. The rest of the data is irrelevant to our sort and is often called satellite data: here, the phone numbers. The index of this book contains primary and secondary keys, and an occasional tertiary key. The page numbers are satellite data.
We will explore several different sorting techniques in this chapter. Some are worse (usually time) than others (usually time). Some perform much better on certain input; others work well regardless of the input.
However, you may never need any of them, because Perl supplies you with a very fast built-in function: sort(). We will explore it first because we can use it to demonstrate what you need to think about when orchestrating a sort operation. The important thing to remember is that
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
All Sorts of Sorts
P