Chapter 4. Sorting

The Librarian had seen many weird things in histime,butthathadtobethe57thstrangest. [footnote:hehadatidymind]

Terry Pratchett, Moving Pictures

Sorting—the act of comparing and rearranging a collection of items—is one of the most important tasks computers perform. Sorting crops up everywhere; whenever you have a collection of items that need to be processed in a particular order, sorting helps you do it quickly.

In this chapter, we will explain what sorting is, how to do it efficiently using Perl’s own sort function, what comparing actually means, and how you can code your own sort algorithms with Perl.

An Introduction to Sorting

Sorting seems so simple. Novices don’t see why it should be difficult, and experts know that there are canned solutions that work very well. Nevertheless, there are tips that will speed up your sorts, and traps that will slow them down. We’ll explore them in this section. But first, the basics.

As in the two previous chapters, we’ll use addresses for our demonstrations. Addresses are an ideal choice, familiar to everyone while complex enough to demonstrate the most sophisticated attributes of data structures and algorithms.

On to sorting terminology. The items to be sorted are called records; the parts of those items used to determine the order are called keys or sometimes fields. The difference is subtle. Sometimes the keys are the records themselves, but sometimes they are just pieces of the records. Sometimes there is more than one key.

Consider three records from a telephone book:

Munro, Alice     15 Brigham Road          623-2448
Munro, Alice     48 Hammersley Place      489-1073
Munro, Alicia    62 Evergreen Terrace     623-6099

The last names are the primary keys because they are the first criterion for ordering entries. When two people have the same last name, the first names must be considered; those are the secondary keys. In the example above, even that isn’t enough, so we need tertiary keys: the street addresses. The rest of the data is irrelevant to our sort and is often called satellite data: here, the phone numbers. The index of this book contains primary and secondary keys, and an occasional tertiary key. The page numbers are satellite data.

We will explore several different sorting techniques in this chapter. Some are worse (usually time) than others (usually time). Some perform much better on certain input; others work well regardless of the input.

However, you may never need any of them, because Perl supplies you with a very fast built-in function: sort(). We will explore it first because we can use it to demonstrate what you need to think about when orchestrating a sort operation. The important thing to remember is that sort is often—but not always—the best possible solution.

Perl’s sort Function

Under the hood, Perl’s sort() function uses the quicksort algorithm, which we’ll describe later in the chapter. This is a standard sorting algorithm, provided by most operating systems as qsort(3).[12] In Versions 5.004_05 and higher, Perl uses its own quicksort implementation instead of the one provided by the operating system. Two primary motivations were behind this change. First, the implementation has been highly optimized for Perl’s particular uses. Second, some vendors’ implementations are buggy and cause errant behavior, sometimes even causing programs to crash.

sort accepts two parameters: a sorting routine and the list of items to sort. The sorting routine can be expressed as a block of code or the name of a subroutine defined elsewhere in the program, or you can omit it altogether. If you do provide a sorting routine, it’s faster to provide it as a block than as a subroutine. Here’s how to provide a subroutine:

@sorted = sort my_comparison @array;

sub my_comparison {
    if    ( $a > $b ) { return  1 }
    elsif ( $b > $a ) { return -1 }
    else              { return  0 }
}

Here’s the same operation, but with the sorting routine expressed as a block:

@sorted = sort { if    ( $a > $b ) { return  1 }
                 elsif ( $b > $a ) { return -1 }
                 else              { return  0 } } @array;

Each of these code snippets places a copy of @array in @sorted, sorted by the criterion we expressed in the sorting routine. The original @array is unchanged. Every sorting routine, whether it’s a subroutine or an actual block, is implicitly given two special variables: $a and $b. These are the items to be compared. Don’t modify them, ever. They are passed by reference, so changing them changes the actual list elements. Changing $a and $b midsort works about as well as changing your tires mid-drive.

The sorting routine must return a number meeting these criteria:

  • If $a is less than $b, the return value should be less than zero.

  • If $a is greater than than $b, the return value should be greater than zero.

  • If $a is equal to $b, the return value should be exactly zero.

As we hinted at before, the sorting routine is optional:

@sorted = sort @array;

This sorts @array in ASCII order, which is sometimes what you want—not always.

ASCII Order

Perl’s default comparison rule is ASCII ordering.[13] Briefly, this means:

control characters< 
most punctuation< 
numbers< 
uppercase letters< 
lowercase letters

The complete ASCII table is available in Appendix B.

Numeric Order

ASCII order won’t help you to sort numbers. You’ll be unpleasantly surprised if you attempt the following:

@array = qw( 1234 +12 5 -3 );
@sorted = sort @array;
print "sorted = @sorted\n";

This produces the strange result:

sorted = +12 -3 1234 5

This is a correct ASCII ordering. ASCII order is very methodical: it always looks at the keys one character at a time, starting from the beginning. As soon as differing ASCII values for those characters are found, the comparison rule is applied. For example, when comparing 1234 to 5, 1234 is smaller because 1 is less than 5. That’s one of the three reasons why ASCII is bad for comparing numbers:

  1. Numbers can start with a + or -. They can also have an e followed by another + or -, or nothing at all, and then some digits. Perl numbers can even have underscores in them to facilitate legibility: one million can be written as 1000000 or 1e6 or +1e+6 or 1_000_000.

  2. If you’re going to look at numbers character-by-character, then you need to look at all of the digits. Quick, which is bigger, 1345978066354223549678 or 926534216574835246783?

  3. Length isn’t good either: 4 is bigger than 3.14, which is bigger than 5e-100.

Fortunately, it’s easy to have Perl sort things in numeric order. You can just subtract $b from $a, or use the more efficient Perl operator designed specifically for comparing numbers: the so-called spaceship operator, <=>.

You can sort numbers as follows:

@sorted_nums = sort { $a <=> $b } @unsorted;

We can use the <=> operator in our example, as follows:

@array = qw(1234 +12 5 -3);
@sorted_nums = sort { $a <=> $b } @array;
print "sorted_nums = @sorted_nums\n";

This produces the result we want:

sorted_nums = -3 5 +12 1234

Reverse Order: From Highest To Lowest

To sort an array from highest to lowest, just flip $a and $b. To order an array of words from highest ASCII value to lowest, you can say:

@words = sort { $b cmp $a } @words;

cmp is Perl’s string comparison operator, the counterpart of the numerical comparison operator, <=>. To sort an array of numbers from highest to lowest:

@numbers = sort { $b <=> $a } @numbers;

These examples also demonstrate something we haven’t yet seen: replacing an array with a sorted copy of itself. We’ve done away with the @sorted variable and simply stored the results in the original array.

Sort::Fields

If you don’t want to concoct your own sorting routines, you might be able to use Joseph N. Hall’s Sort::Fields module, available from CPAN. With it you can say convoluted things like “alphabetic sort on column 4, a numeric sort on column 1, and finally a reverse numeric sort on column 3.” You’d express this as follows:

use Sort::Fields;
print fieldsort [4, '1n', '-3n'], @data;

The alphabetic sort is an ASCII sort—unless you include the use locale statement, which we’ll discuss shortly. fieldsort() is just a wrapper for the module’s make_fieldsort() function, which returns a subroutine:

use Sort::Fields;
my $sort = make_fieldsort [4, '1n', '-3n'];
print $sort->( @data );

If you are going to perform several Sort::Fields operations using the same sorting rules, use make_fieldsort() directly because fieldsort() will call it each time. It’s faster to create the sorting subroutine once and reuse it later than to create it anew each time you call fieldsort(). The module also has stable versions of these functions: stable_fieldsort() and make_stable_fieldsort(). We’ll discuss stability in Section 4.2.

Sort::Versions

Software version numbers don’t sort like regular numbers. There can be several fields, separated by dots. The fields might also have letters. For example:

1a
1.1
1.2
1.2a
1.2.1
1.2.a
1.2.b
1.03

The module Sort::Versions, by Kenneth Albanowski, provides two subroutines: versions() and versioncmp(). The former is used as a sorting routine, the latter as a general function for comparing two Perl scalars as version numbers:

use Sort::Versions;
@releases = sort versions qw( 2.3 2.4 2.3.1 2.3.0 2.4b );

print "earlier" if versioncmp( "3.4", "3.4a" ) == -1;

Note: if you use underscores to enhance the readability of your “numbers”, like 5.004_05, you need to remove the underscores before attempting a numeric comparison. An aside about underscores: Perl recognizes and removes them only from literal numbers at compile time. If you say perl -e "print 1_000_000", Perl prints 1000000. However, Perl won’t do the same for strings: The underscores in $version = "5.004_05" stay put. So for sorting version numbers, you’ll want to remove them:

@releases = sort versions map { tr/_//d; $_ } @array;

This is a nuisance, but it’s necessary for backward compatibility: if Perl suddenly started parsing numbers after the underscore, thousands of existing scripts would break.

Dictionary Order

Dictionary order is another commonly used ordering. The strings are first transformed by removing everything except letters and numbers. Uppercase and lowercase variants are considered equal. These rules make words like re-evaluate, reevaluating, and Reevaluator sort close together. In ASCII order, they would be widely separated:

Reevaluator
Rembrandt
...
Zorro
...
chthonic
...
re-evaluate
rectangle
...
reevaluating

The difficulties don’t end here. In telephone books, finding people with names like De Lorean is troublesome. Is that under D or L? Similarly for abbreviations: should they be sorted according to the abbreviation itself or by the full name? Does IBM go between IAA and ICA or between Immigration and Ionization?

Further confusion arises from variations in spelling: Munro/Monroe, MacTavish/McTavish, Krysztof/Christoph, Peking/Beijing. In principle it would be nice to be able to find each pair at the same place when searching; a way to do this is shown in Section 9.3.1 in Chapter 9. Accommodating such a complicated criterion might introduce extra keys into the records—the primary key might even not be part of the original record at all!

Yet more fun occurs when the elements contain multibyte characters. In the world of ASCII, this never happens: every character takes up one byte. But in, say, Spanish, ch is a letter of its own, to be sorted between c and d: so chocolate follows color.[14] The international Unicode standard and Asian legacy standards define several different multibyte encodings. Especially nasty from the sorting viewpoint are those that have variable widths. For more information about different character encodings, see http://www.unicode.org/and http://www.czyborra.com/.

A simple version (that doesn’t handle quirky names, abbreviations, or letters) for dictionary order sorting follows. Remember, $a and $b must never ever be modified, so we make “dictionary versions” of the items to be compared: $da and $db.

@dictionary_sorted =
    sort {
        my $da = lc $a;         # Convert to lowercase.
        my $db = lc $b;
        $da =~ s/\W+//g;        # Remove all nonalphanumerics.
        $db =~ s/\W+//g;
        $da cmp $db;            # Compare.
     } @array;

There are at least two problems with the preceding code, however. They aren’t bugs, since the above sorting routine works correctly—sometimes.

Sorting Efficiency

The preceding program runs very slowly on long lists. Unnecessarily slowly. The problem is that the sorting routine is called every time two elements need to be compared. The same elements will enter the sorting routine several times, sometimes as $a and sometimes as $b. This in turn means that the transformation to the dictionary version will be performed again and again for each word, even though we should only need to do it once. Let’s illustrate this with a sort routine:

my @sorted =
    sort { my $cmp = $a cmp $b;
           $saw{ $a }++;
           $saw{ $b }++;
           print "a = $a, b = $b, cmp = $cmp, ",
                 "a is ",
                 $cmp < 0 ?
                   "smaller" : ( $cmp > 0 ? "bigger" : "equal" ),
                 " ",
                 $cmp ? "than" : "to", " b",
                 "\n";
           return $cmp
          }                  qw(you can watch what happens);

foreach ( sort keys %saw ) {
    print "$_ $saw{ $_ } times\n";
}

This displays the following:

a = you, b = can, cmp = 1, a is bigger than b
a = you, b = watch, cmp = 1, a is bigger than b
a = can, b = watch, cmp = -1, a is smaller than b
a = you, b = what, cmp = 1, a is bigger than b
a = watch, b = what, cmp = -1, a is smaller than b
a = you, b = happens, cmp = 1, a is bigger than b
a = what, b = happens, cmp = 1, a is bigger than b
a = watch, b = happens, cmp = 1, a is bigger than b
a = can, b = happens, cmp = -1, a is smaller than b
can 3 times
happens 4 times
watch 4 times
what 3 times
you 4 times

Every word is compared three or four times. If our list were larger, there would have been even more comparisons per word. For large lists or a computationally expensive sorting routine, the performance degradation is substantial.

There is a Perl trick for avoiding the unnecessary work: the Schwartzian Transform, named after Randal Schwartz. The basic idea of the Schwartzian Transform is this: take the list to be sorted and create a second list combining both the original value and a transformed value to be used for the actual sorting. After the sort, the new value is thrown away, leaving only the elements of the original list.[15]

The Schwartzian Transform is described in more detail later in this chapter, but here is some dictionary sorting code that uses it. Thanks to the transform, the dictionary order transformation is performed only once for each word.

use locale;

# Fill @array here.

@dictionary_sorted =
       map { $_->[0] }
           sort { $a->[1] cmp $b->[1] }
             map {
                   my $d = lc;          # Convert into lowercase.
                   $d =~ s/[\W_]+//g;   # Remove nonalphanumerics.
                   [ $_, $d ]           # [original, transformed]
                 }
           @array;

In this particular case we can do even better and eliminate the anonymous lists. Creating and accessing them is slow compared to handling strings, so this will speed up our code further:

use locale;

@dictionary_sorted =
    map { /^\w* (.*)/ }
       sort
          map {
               my $d = lc;          # Convert into lowercase.
               $d =~ s/[\W_]+//g;   # Remove nonalphanumerics.
               "$d $_"              # Concatenate new and original words.
              }
        @array;

We transform the original strings into new strings containing both the transformed version and the original version. Then we sort on those transformed strings, and finally snip off the sorting keys and the space in between them, leaving only the original strings. However, this technique only works under these conditions:

  • You have to be able to produce sort keys that sort correctly with string comparison. Integers work only if you add leading spaces or zeros to align them on the right.

  • You have to able to stringify and later destringify the data—the stringification must be exactly reversible. Floating point numbers and objects need not apply.

  • You have to able to decouple the transformed sort key from the original data: in our sort, we did this by first destroying all [\W_] characters and then using such a character, the space, as a separator.

Now our dictionary sort is robust, accurate, and fast.

The Schwartzian Transform

The Schwartzian Transform is a cache technique that lets you perform the time-consuming preprocessing stage of a sort only once. You can think of the Transform as a nested series of operations, modeled in Figure 4-1.

The structure of the Schwartzian Transform
Figure 4-1. The structure of the Schwartzian Transform

The map function transforms one list into another, element by element. We’ll use

@array = qw(opal-shaped opalescent Opalinidae);

as the list and the dictionary transformation from the previous section:

my $d = lc;            # Convert into lowercase.
$d =~ s/[\W_]+//g;
[ $_, $d ]

so that the Schwartzian Transform in our case looks like Figure 4-2.

The Schwartzian Transform for our example
Figure 4-2. The Schwartzian Transform for our example

As the first step in the operation, the list to be sorted:

is transformed into another list by the innermost (rightmost) map:

The old words are on the left; the new list is on the right. The actual sort is then performed using the new transformed list, on the right:[16]

However, the desired sort results are the plain old elements, not these intermediate lists. These elements are retrieved by peeling away the now-useless transformed words with the outermost (leftmost) map:

This is what ends up in @sorted.

Long duration caching

The Schwartzian Transform caches only for the duration of one sort. If you’re going to sort the same elements several times but with different orderings or with different subselections of the elements, you can use a different strategy for even greater savings: the sort keys can be precomputed and stored in a separate data structure, such as an array or hash:

# Initialize the comparison cache.

%sort_by = ();

foreach $word ( @full_list ) {
    $sort_by{ $word } =
        some_complex_time_consuming_function($word);
}

The %sort_by hash can then be used like this:

@sorted_list =
    sort
        { $sort_by{ $a } <=> $sort_by{ $b } }
        @partial_list;

This technique, computing derived values and storing them for later use, is called memoizing. The Memoize module, by Mark-Jason Dominus, described briefly in Section 1.2.5 in Chapter 1, is available on CPAN.

Deficiency: missing internationalization (locales)

ASCII contains the 26 letters familiar to U.S. readers, but not their exotic relatives:

déjà vu
façade
naïve
Schrödinger

You can largely blame computers for why you don’t often see the ï of naïve: for a long time, support for “funny characters” was nonexistent. However, writing foreign words and names correctly is a simple matter of courtesy. The graphical differences might seem insignificant but then again, so are the differences between 0 and O, or 1 and l. When spoken, a and ä may have completely different sounds, and the meanings of words can change when letters are replaced with an ASCII substitute. For example, stripping the diaereses from Finnish säästää (“to save”) leaves saastaa (“filth”).

These multicultural hardships are alleviated in part by locales. A locale is a set of rules represented by language-country-encoding triplet. Locales are encoded as strings, for example fr_CA.ISO8859-1 for French-Canadian-ISO Latin 1.[17] The rules specify things like which characters are letters and how they should be sorted.

Earlier, we mentioned how multibyte characters can impact naïve sorting. Even single byte characters can present obstacles; for example, in Swedish å is sorted after z, and nowhere near a.

One way to refer to an arbitrary alphanumeric character regardless of locale is with the Perl regular expression metacharacter \w. And even that isn’t quite right because \w includes _. The reason for this is historical: _ is often used in computers as if it were a true letter, as parts of names that are really phrases, like_this. A rule of thumb is that \w matches Perl identifiers; [A-Z] matches only a range of 26 ASCII letters.

Even if we use \w, Perl still won’t treat the funny letters as true characters. The actual way of telling Perl to understand such letters is a long and system-dependent story. Please see the perllocale documentation bundled with Perl for details. For now, we’ll assume your operating system has locale support installed and that your own personal locale setup is correct. If so, all Perl needs is the locale pragma placed near the beginning of your script:

use locale;

This tells Perl to use your locale environment to decide which characters are letters and how to order them, among other things. We can update our sorting program to handle locales as follows:

use locale;

# Fill @array here...

@dictionary_sorted =
    sort {
        my $da = lc $a;         # Translate into lowercase.
        my $db = lc $b;
        $da =~ s/[\W_]+//g;     # Remove all nonalphanumerics.
        $db =~ s/[\W_]+//g;
        $da cmp $db;            # Compare.
     } @array;

print "@dictionary_sorted";

Sort::ArbBiLex

Often, vendor-supplied locales are lacking, broken, or completely missing. In this case, the Sort::ArbBiLex module by Sean M. Burke comes in handy. It lets you construct arbitrary bi-level lexicographic sort routines that specify in great detail how characters and character groups should be sorted. For example:

use Sort::ArbBiLex;

*Swedish_sort = Sort::ArbBiLex::maker(
  "a A
   o O
   ä Ä
   ö Ö
  "
);
*German_sort = Sort::ArbBiLex::maker(
  "a A
   ä Ä
   o O
   ö Ö
  "
);
@words = qw(Möller Märtz Morot Mayer Mortenson Mattson);
foreach (Swedish_sort(@words)) { print "på svenska:  $_\n" }
foreach (German_sort (@words)) { print "auf Deutsch: $_\n" }

This prints:

på svenska:  Mayer
på svenska:  Mattson
på svenska:  Morot
på svenska:  Mortenson
på svenska:  Märtz
på svenska:  Möller
auf Deutsch: Mayer
auf Deutsch: Mattson
auf Deutsch: Märtz
auf Deutsch: Morot
auf Deutsch: Mortenson
auf Deutsch: Möller

Notice how Märtz and Möller are sorted differently.

See for yourself: use the Benchmark module

How substantial are the savings of the Schwartzian Transform? You can measure phenomena like this yourself with the Benchmark module (see Section 1.2.2 in Chapter 1 for more information). We will use Benchmark::timethese() to benchmark with and without the Schwartzian Transform:

use Benchmark;

srand; # Randomize.
       # NOTE: for Perls < 5.004 
       # use srand(time + $$ + ($$ << 15)) for better results

# Generate a nice random input array.
@array = reverse 'aaa'..'zaz';

# Mutate the @array.
for ( @array ) {
     if (rand() < 0.5) {   # Randomly capitalize.
         $_ = ucfirst;
     }
     if (rand() < 0.25) {  # Randomly insert underscores.
         substr($_, rand(length), 0)= '_';
     }
     if (rand() < 0.333) { # Randomly double.
         $_ .= $_;
     }
     if (rand() < 0.333) { # Randomly mirror double.
         $_ .= reverse $_;
     }
     if (rand() > 1/length) { # Randomly delete characters.
         substr($_, rand(length), rand(length)) = '';
     }
}

# timethese() comes from Benchmark.

timethese(10, {
    'ST' =>
    '@sorted =
        map { $_->[0] }
            sort { $a->[1] cmp $b->[1] }
                map { # The dictionarization.
                  my $d = lc;
                  $d =~ s/[\W_]+//g;
                  [ $_, $d ]
                }
                @array',
   'nonST' =>
   '@sorted =
       sort { my ($da, $db) = ( lc( $a ), lc( $b ) );
              $da =~ s/[\W_]+//g;
              $db =~ s/[\W_]+//g;
              $da cmp $db;
            }
            @array'
    });

We generate a reasonably random input array for our test. In one particular machine,[18] this code produces the following:

Benchmark: timing 10 iterations of ST, nonST...
        ST: 22 secs (19.86 usr  0.55 sys = 20.41 cpu)
     nonST: 44 secs (43.08 usr  0.15 sys = 43.23 cpu)

The Schwartzian Transform is more than twice as fast.

The Schwartzian Transform can transform more than strings. For instance, here’s how you’d sort files based on when they were last modified:

@modified =
        map { $_->[0] }
            sort { $a->[1] <=> $b->[1] }
                # -M is when $_ was last modified
                map { [ $_, -M ] }
                    @filenames;

Sorting Hashes Is Not What You Might Think

There is no such thing as a sorted hash. To be more precise: sorting a simple hash is unthinkable. However, you can create a complex hash that allows for sorting with tie.

In Perl, it is possible to tie arrays and hashes so that operations like storing and retrieving can trigger special operations, such as maintaining order within a hash. One example is the BTREE method for sorted, balanced binary trees, available in the DB_File module bundled with the Perl distribution and maintained by Paul Marquess, or the Tie::IxHash module by Gurusamy Sarathy available from CPAN.

But back to simple hashes: As you know, a hash is a list of key-value pairs. You can find a value by knowing its key—but not vice versa. The keys are unique; the values need not be. Let’s look at the bookshelf of a science fiction buff. Here are the number of books (the values) for each author (the keys):

%books = ("Clarke" => 20, "Asimov" => 25, "Lem" => 20);

You can walk through this hash in “hash order” with Perl’s built-in keys, values, and each operators, but that’s not really a sorted hash. As was mentioned in Chapter 2, the internal hash ordering is determined by Perl so that it can optimize retrieval. This order changes dynamically as elements are added and deleted.

foreach $author ( sort keys %books ) {
    print "author = $author, books = $books{$author}\n";
}

You can also walk through the hash in the order of the values. But be careful, since the values aren’t guaranteed to be unique:

foreach $author ( sort { $books{ $a } <=> $books{ $b } } keys %books ) {
    print "author = $author, ";
    print "books = $books{$author}\n";
}

As you can see, the keys aren’t sorted at all:

author = Lem, books = 20
author = Asimov, books = 20
author = Clarke, books = 25

We can make sort adjudicate ties (that is, when <=> yields 0). When that happens, we’ll resort to an alphabetical ordering (cmp) of the author names:

foreach $author ( sort {
                         my $numcmp = $books{ $a } <=> $books{ $b };
                         return $numcmp if $numcmp;
                         return $a cmp $b;
                       } keys %h ) {
    print "author = $author, ";
    print "books = $books{$author}\n";
}

This outputs:

author = Asimov, books = 20
author = Lem, books = 20
author = Clarke, books = 25

Note that we didn’t do this: sort { $a <=> $b } values %books—and for a good reason: it would make no sense, because there’s no way to retrieve the key given the value.

It is possible to “reverse” a hash, yielding a new hash where the keys become values and the values become keys. You can do that with hashes of lists or, more precisely, a hash of references to lists. We need lists because a given hash might not be a one-to-one mapping. If two different keys have the same value, it’s a one-to-many mapping.

%books = ("Clarke" => 20, "Asimov" => 25, "Lem" => 20);
%books_by_number = ();

while ( ($key, $value) = each %books ) {
    push @{ $books_by_number{ $value } }, $key;
}

foreach $number ( sort { $a <=> $b } keys %books_by_number ) {
    print "number = $number, ";
    print "authors = @{ $books_by_number{ $number } }\n";
}

This displays:

number = 20, authors = Clarke Lem
number = 25, authors = Asimov

After all this talk about the trickiness involved in sorting hashes, prepare yourself for the horror that occurs if you mistakenly try to sort a hash directly. Had we tried %torn_books = sort %books; we end up with this:

Clarke => 'Lem',
20     => 20,
25     => 'Asimov'

Clarke has written “Lem” books, and 25 has written “Asimov” books?

So don’t do that.

All Sorts of Sorts

Perl’s own sort is very fast, and it’s useful to know why it’s fast—and when it’s not. Eventually, you’ll stumble upon situations in which you can improve performance by using some of the algorithms in this section. Here, we compare several families of sorting algorithms and describe the situations in which you’ll want to use them. The guiding light for choosing an algorithm is this: the more you know about your data, the better.

Sorting algorithms can scale well or poorly. An algorithm scales well when the running time of the sort doesn’t increase much as the number of elements increases. A poorly scaling algorithm is typically : when the number of elements doubles, the running time quadruples. For sorting, “scaling well” usually means ; we’ll call this log-linear.

In addition to their running times, sorting algorithms can be categorized by their stability and sensitivity. Stability refers to the fate of records with identical keys: a stable algorithm preserves their original order, while an unstable algorithm might not. Stability is a good thing, but it’s not vital; often we’ll want to sacrifice it for speed.

Sensitive algorithms are volatile.[19] They react strongly (either very well or very poorly) to certain kinds of input data. Sensitive sorting algorithms that normally perform well might perform unexpectedly poorly on some hard-to-predict random order or a nearly sorted order or a reversed order. For some algorithms, the order of input does not matter as much as the distribution of the input. Insensitive algorithms are better because they behave more predictably.

In the remainder of this chapter all the algorithms sort strings. If you want numeric sorting, change the string operators to their numeric equivalents: gt should become >, eq should become ==, and so on. Alternatively, the subroutines could be implemented in a more general (but slower) way to accept a sorting routine as a parameter.

Unlike Perl’s sort, most of these algorithms sort arrays in place (also known as in situ), operating directly on their arguments instead of making copies. This is a major benefit if the arrays are large because there’s no need to store both the original array and the sorted one; you get an instant 50% savings in memory consumption. This also means you should provide your list as an array reference, not as a regular array. Passing references to subroutines avoids copying the array and is therefore faster.

We show graphs that compare the performance of these sorting techniques at the end of the chapter.

Quadratic Sorting Algorithms

Here we present the three most basic sorting algorithms. They also happen to be the three worst techniques for the typical use: sorting random data. The first of these three algorithms, selection sort, fares quite poorly as a general sorting algorithm but is good for finding the minimum and maximum of unordered data.

The next two quadratic sorts, bubble sort and insertion sort, are also poor choices for random data, but in certain situations they are the fastest of all.

If there are constraints in how data can be moved around, these two sorts might be the best choices. An analogy of this would be moving heavy boxes around or moving the armature of a jukebox to select the appropriate CD. In these cases, the cost of moving elements is very high.

Selection sort

The selection sort is the simplest sorting algorithm. Find the smallest element and put it in the appropriate place. Lather. Rinse. Repeat.

The first steps of selection sort: alternating minimum-finding scans and swaps
Figure 4-3. The first steps of selection sort: alternating minimum-finding scans and swaps

Figure 4-3 illustrates selection sort. The unsorted part of the array is scanned (as shown by the horizontal line), and the smallest element is swapped with the lowest element in that part of the array (as shown by the curved lines.) Here’s how it’s implemented for sorting strings:

sub selection_sort {
    my $array = shift;

    my $i;      # The starting index of a minimum-finding scan.
    my $j;      # The running  index of a minimum-finding scan.

    for ( $i = 0; $i < $#$array ; $i++ ) {
        my $m = $i;             # The index of the minimum element.
        my $x = $array->[ $m ]; # The minimum value.

        for ( $j = $i + 1; $j < @$array; $j++ ) {
            ( $m, $x ) = ( $j, $array->[ $j ] ) # Update minimum.
              if $array->[ $j ] lt $x;
        }

        # Swap if needed.
        @$array[ $m, $i ] = @$array[ $i, $m ] unless $m == $i;
    }
}

We can invoke selection_sort() as follows:

@array = qw(able was i ere i saw elba); selection_sort(\@array); print "@array\n";
able elba ere i i saw was 

Don’t use selection sort as a general-purpose sorting algorithm. It’s dreadfully slow——which is a pity because it’s both stable and insensitive.

A short digression: pay particular attention to the last line in selection_sort(), where we use array slices to swap two elements in a single statement.

Minima and maxima

The selection sort finds the minimum value and moves it into place, over and over. If all you want is the minimum (or the maximum) value of the array, you don’t need to sort the the rest of the values—you can just loop through the elements, a procedure. On the other hand, if you want to find the extremum multiple times in a rapidly changing data collection, use a heap, described in Section 3.8 in Chapter 3. Or, if you want a set of extrema (“Give me the ten largest”), use the percentile() function described in Section 4.2.2.4 later in this chapter.

For unordered data, minimum() and maximum() are simple to implement since all the elements must be scanned.

A more difficult issue is which comparison to use. Usually, the minimum and the maximum would be needed for numerical data; here, we provide both numeric and string variants. The s-prefixed versions are for string comparisons, and the g-prefixed versions are generic: they take a subroutine reference as their first parameter, and that subroutine is used to compare the elements. The return value of the subroutine must behave just like the comparison subroutine of sort: a negative value if the first argument is less than the second, a positive value if the first argument is greater than the second, and zero if they are equal. One critical difference: because it’s a regular subroutine, the arguments to be compared are $_[0] and $_[1] and not $a and $b.

The algorithms for the minimum are as follows:

sub min { # Numbers.
    my $min = shift;
    foreach ( @_ ) { $min = $_ if $_ < $min }
    return $min;
}

sub smin { # Strings.
    my $s_min = shift;
    foreach ( @_ ) { $s_min = $_ if $_ lt $s_min }
    return $smin;
}

sub gmin { # Generic.
    my $g_cmp = shift;
    my $g_min = shift;
    foreach ( @_ ) { $g_min = $_ if $g_cmp->( $_, $g_min ) < 0 }
    return $g_min;
}

Here are the algorithms for the maximum:

sub max { # Numbers.
    my $max = shift;
    foreach ( @_ ) { $max = $_ if $_ > $max }
    return $max;
}

sub smax { # Strings.
    my $s_max = shift;
    foreach ( @_ ) { $s_max = $_ if $_ gt $s_max }
    return $s_max;
}

sub gmax { # Generic.
    my $g_cmp = shift;
    my $g_max = shift;
    foreach ( @_ ) { $g_max = $_ if $g_cmp->( $_, $g_max ) > 0 }
    return $g_max;
}

In the generic subroutines, you’ll notice that we invoke the user-provided subroutine as $code_refererence->(arguments). That’s less punctuation-intensive than the equivalent &{$code_refererence}(arguments).

If you want to know which element contains the minimum instead of the actual value, we can do that as follows:

sub mini {
    my $l = $_[ 0 ];
    my $n = @{ $l };
    return ( ) unless $n;        # Bail out if no list is given.
    my $v_min = $l->[ 0 ];       # Initialize indices.
    my @i_min = ( 0 );

    for ( my $i = 1; $i < $n; $i++ ) {
        if ( $l->[ $i ] < $v_min ) {
            $v_min = $l->[ $i ]; # Update minimum and
            @i_min = ( $i );     # reset indices.
        } elsif ( $l->[ $i ] == $v_min ) {
            push @i_min, $i;     # Accumulate minimum indices.
        }
    }

    return @i_min;
}

sub maxi {
    my $l = $_[ 0 ];
    my $n = @{ $l };
    return ( ) unless $n;        # Bail out if no list is given.
    my $v_max = $l->[ 0 ];       # Initialize indices.
    my @i_max = ( 0 );

    for ( my $i = 1; $i < $n; $i++ ) {
        if ( $l->[ $i ] > $v_max ) {
            $v_max = $l->[ $i ]; # Update maximum and
            @i_max = ( $i );     # reset indices.
        } elsif ( $l->[ $i ] == $v_max ) {
            push @i_max, $i;     # Accumulate maximum indices.
        }
    }

    return @i_max;
}

smini(), gmini(), smaxi(), and gmaxi() can be written similarly. Note that these functions should return arrays of indices instead of a single index since the extreme values might lie in several array locations:

# Index:    0  1  2  3  4  5  6  7  8  9 10 11
my @x = qw(31 41 59 26 59 26 35 89 35 89 79 32);

my @i_max = maxi(\@x);    # @i_max should now contain 7 and 9.

Lastly, we present a general extrema-finding subroutine. It uses a generic sorting routine and returns the minima- or maxima-holding indices:

sub gextri {
   my $g_cmp = $_[ 0 ];
   my $l     = $_[ 1 ];
   my $n     = @{ $l };
   return ( ) unless $n;                # Bail out if no list is given.
   my $v_min = $l->[ 0 ];
   my $v_max = $v_min;                  # The maximum so far.
   my @i_min = ( 0 );                   # The minima indices.
   my @i_max = ( 0 );                   # The maxima indices.
   my $v_cmp;                           # The result of comparison.

   for ( my $i = 1; $i < $n; $i++ ) {
       $v_cmp = $g_cmp->( $l->[ $i ], $v_min );
       if ( $v_cmp < 0 ) {
           $v_min = $l->[ $i ];         # Update minimum and reset minima.
           @i_min = ( $i );
       } elsif ( $v_cmp == 0 ) {
           push @i_min, $i ;            # Accumulate minima if needed.
       } else {                         # Not minimum: maybe maximum?
           $v_cmp = $g_cmp->( $l->[ $i ], $v_max );
           if ( $v_cmp > 0 ) {
               $v_max = $l->[ $i ];     # Update maximum and reset maxima.
               @i_max = ( $i );
           } elsif ( $v_cmp == 0 ) {
               push @i_max, $i;         # Accumulate maxima.
           }
        }                               # Else neither minimum nor maximum.
    }
    return ( \@i_min, \@i_max );
}

This returns a list of two anonymous arrays (array references) containing the indices of the minima and maxima:

#           0  1  2  3  4  5  6  7  8  9 10 11
my @x = qw(31 41 59 26 59 26 35 89 35 89 79 32);

my ($i_min, $i_max) = gextri(sub { $_[0] <=> $_[1] }, \@x);

# @$i_min now contains 3 and 5.
# @$i_max now contains 7 and 9.

Remember that the preceding extrema-finding subroutines make sense only for unordered data. They make only one linear pass over the data—but they do that each time they are called. If you want to search the data quickly or repeatedly, see Section 3.8 in Chapter 3.

Bubble sort

The bubble sort has the cutest and most descriptive name of all the sort algorithms—but don’t be tempted by a cute name.

This sort makes multiple scans through the array, swapping adjacent pairs of elements if they’re in the wrong order, until no more swaps are necessary. If you follow an element as it propagates through the array, that’s the “bubble.”

The first steps of bubble sort: large elements bubble forward
Figure 4-4. The first steps of bubble sort: large elements bubble forward

Figure 4-4 illustrates the first full scan (stages a to g) and the first stages of the second scan (stages h and i).

sub bubblesort {
    my $array = shift;

    my $i;              # The initial index for the bubbling scan.
    my $j;              # The running index for the bubbling scan.
    my $ncomp = 0;      # The number of comparisons.
    my $nswap = 0;      # The number of swaps.

    for ( $i = $#$array; $i; $i-- ) {
        for ( $j = 1; $j <= $i; $j++ ) {
            $ncomp++;
            # Swap if needed.
            if ( $array->[ $j - 1 ] gt $array->[ $j ] ) {
                @$array[ $j, $j - 1 ] = @$array[ $j - 1, $j ];
                $nswap++;
            }
        }
    }
    print "bubblesort:  ", scalar @$array,
          " elements, $ncomp comparisons, $nswap swaps\n";
}

We have included comparison and swap counters, $ncomp and $nswap, for comparison with a variant of this routine to be shown later. The later variant greatly reduces the number of comparisons, especially if the input is sorted or almost sorted.

Avoid using bubble sort as a general-purpose sorting algorithm. Its worst-case performance is , and its average performance is one of the worst because it might traverse the list as many times as there are elements. True, the unsorted part of the list does get one element shorter each time, yielding the series , but that’s still .

However, bubble sort has a very interesting property: for fully or almost fully sorted data it is the fastest algorithm of all. It might sound strange to sort sorted data, but it’s a frequent situation: suppose you have a ranked list of sports teams. Whenever teams play, their ranks change—but not by much. The rankings are always nearly sorted. To reduce the left and right bounds of the sorted area more quickly when the data is already mostly sorted, we can use the following variant:

sub bubblesmart {
    my $array = shift;
    my $start = 0;        # The start index of the bubbling scan.
    my $ncomp = 0;        # The number of comparisons.
    my $nswap = 0;        # The number of swaps.

    my $i = $#$array;

    while ( 1 ) {
        my $new_start;    # The new start index of the bubbling scan.
        my $new_end = 0;  # The new end index of the bubbling scan.

        for ( my $j = $start || 1; $j <= $i; $j++ ) {
            $ncomp++;
            if ( $array->[ $j - 1 ] gt $array->[ $j ] ) {
                @$array[ $j, $j - 1 ] = @$array[ $j - 1, $j ];
                $nswap++;
                $new_end   = $j - 1;
                $new_start = $j - 1 unless defined $new_start;
            }
        }
        last unless defined $new_start; # No swaps: we're done.
        $i     = $new_end;
        $start = $new_start;
    }
    print "bubblesmart: ", scalar @$array,
          " elements, $ncomp comparisons, $nswap swaps\n";
}

You can compare this routine and the original bubblesort with the following code:

@a = "a".."z";

# Reverse sorted, both equally bad.
@b = reverse @a;

# Few inserts at the end.
@c = ( @a, "a".."e" );

# Random shuffle.
srand();
foreach ( @d = @a ) {
    my $i = rand @a;
    ( $_, $d[ $i ] ) = ( $d[ $i ], $_ );
}

my @label = qw(Sorted Reverse Append Random);
my %label;
@label{\@a, \@b, \@c, \@d} = 0..3;
foreach my $var ( \@a, \@b, \@c, \@d ) {
    print $label[$label{$var}], "\n";
    bubblesort  [ @$var ];
    bubblesmart [ @$var ];
}

This will output the following (the number of comparisons at the last line will vary slightly):

Sorted
bubblesort:  26 elements, 325 comparisons, 0 swaps
bubblesmart: 26 elements, 25 comparisons, 0 swaps
Reverse
bubblesort:  26 elements, 325 comparisons, 325 swaps
bubblesmart: 26 elements, 325 comparisons, 325 swaps
Append
bubblesort:  31 elements, 465 comparisons, 115 swaps
bubblesmart: 31 elements, 145 comparisons, 115 swaps
Random
bubblesort:  26 elements, 325 comparisons, 172 swaps
bubblesmart: 26 elements, 279 comparisons, 172 swaps

As you can see, the number of comparisons is lower with bubblesmart() and significantly lower for already sorted data. This reduction in the number of comparisons does not come for free, of course: updating the start and end indices consumes cycles.

For sorted data, the bubble sort runs in linear time, , because it quickly realizes that there is very little (if any) work to be done: sorted data requires only a few swaps. Additionally, if the size if the array is small, so is . There is not a lot of work done in each of the actions, so this can be faster than an algorithm that does more work for each of its steps. This feature makes bubble sort very useful for hybrid sorts, which we’ll encounter later in the chapter.

Insertion sort

Insertion sort scans all elements, finds the smallest, and “inserts” it in its proper place. As each correct place is found, the remaining unsorted elements are shifted forward to make room, and the process repeats. A good example of insertion sort is inserting newly bought books into an alphabetized bookshelf. This is also the trick people use for sorting card hands: the cards are arranged according to their value one at a time.[20]

In Figure 4-5, steps a, c, and e find the minimums; steps b, d, and e insert those minimums into their rightful places in the array. insertion_sort() implements the procedure:

sub insertion_sort {
    my $array = shift;

    my $i;      # The initial index for the minimum element.
    my $j;      # The running index for the minimum-finding scan.

    for ( $i = 0; $i < $#$array; $i++ ) {
        my $m = $i;             # The final index for the minimum element.
        my $x = $array->[ $m ]; # The minimum value.

        for ( $j = $i + 1; $j < @$array; $j++ ) {
            ( $m, $x ) = ( $j, $array->[ $j ] ) # Update minimum.
              if $array->[ $j ] lt $x;
        }

        # The double-splice simply moves the $m-th element to be
        # the $i-th element.  Note: splice is O(N), not O(1).
        # As far as the time complexity of the algorithm is concerned
        # it makes no difference whether we do the block movement
        # using the preceding loop or using splice().  Still, splice()
        # is faster than moving the block element by element.
        splice @$array, $i, 0, splice @$array, $m, 1 if $m > $i;
    }
}
The first steps of insertion sort
Figure 4-5. The first steps of insertion sort

Do not use insertion sort as a general-purpose sorting algorithm. It has worst-case, and its average performance is one of the worst of the sorting algorithms in this chapter. However, like bubble sort, insertion sort is very fast for sorted or almost sorted data——and for the same reasons. The two sorting algorithms are actually very similar: bubble sort bubbles large elements up through an unsorted area to the end, while insertion sort bubbles elements down through a sorted area to the beginning.

The preceding insertion sort code is actually optimized for already sorted data. If the $j loop were written like this:

for ( $j = $i;
      $j > 0 && $array->[ --$j ] gt $small; ) { }
      # $small is the minimum element

$j++ if $array->[ $j ] le $small;

sorting random or reversed data would slightly speed up (by a couple of percentage points), while sorting already sorted data would slow down by about the same amount.

One hybrid situation is especially appropriate for insertion sort: let’s say you have a large sorted array and you wish to add a small number of elements to it. The best procedure here is to sort the small group of newcomers and then merge them into the large array. Because both arrays are sorted, this insertion_merge() routine can merge them together in one pass through the larger array:

sub insertion_merge {
    my ( $large, $small ) = @_;

    my $merge;  # The merged result.
    my $i;      # The index to @merge.
    my $l;      # The index to @$large.
    my $s;      # The index to @$small.

    $#$merge = @$large + @$small - 1; # Pre-extend.

    for ( ($i, $l, $s) = (0, 0, 0); $i < @$merge; $i++ ) {
        $merge->[ $i ] =
          $l < @$large &&
            ( $s == @$small || $large->[ $l ] < $small->[ $s ] ) ?
              $large->[ $l++ ] :
              $small->[ $s++ ] ;
    }

    return $merge;
}

Here’s how we’d use insertion_merge() to insert some primes into squares:

@large = qw( 1  4  9 16 25 36 49 64 81 100);
@small = qw( 2  5 11 17 23);
$merge = insertion_merge( \@large, \@small );
print "@{$merge}\n";
1 2 4 5 9 11 16 17 23 25 36 49 64 81 100

Shellsort

Shellsort is an advanced cousin of bubble sort. While bubble sort swaps only adjacent elements, shellsort swaps the elements over much longer distances. With each iteration, that distance shortens until it reaches one, and after that pass, the array is sorted. The distance is called the shell. The term isn’t so great a metaphor as one would hope; the sort is named after its creator, Donald Shell.

The shell spirals from the size of the array down to one element. That spiraling can happen via many paths. For instance, it might be this:

Shellsort

Or it might be this:

Shellsort

No series is always the best: the optimal series must be customized for each input. Of course, figuring that out might take as long as the sort, so it’s better to use a reasonably well-performing default. Besides, if we really knew the input intimately, there would be even better choices than shellsort. More about that in Section 4.2.3.

In our sample code we will calculate the shell by starting with k 0 = 1 and repeatedly calculating ki+1 = 2ki + 1 resulting in the series 1, 3, 7, 15, .... We will use the series backwards, starting with the largest value that is smaller than the size of the array, and ending with 1:

sub shellsort {
    my $array = shift;

    my $i;              # The initial index for the bubbling scan.
    my $j;              # The running index for the bubbling scan.
    my $shell;          # The shell size.
    my $ncomp = 0;      # The number of comparisons.
    my $nswap = 0;      # The number of swaps.

    for ( $shell = 1; $shell < @$array; $shell = 2 * $shell + 1 ) {
        # Do nothing here, just let the shell grow.
    }

    do {
        $shell = int( ( $shell - 1 ) / 2 );
        for ( $i = $shell; $i < @$array; $i++ ) {
            for ( $j = $i - $shell;
                  $j >= 0 && ++$ncomp &&
                    $array->[ $j ] gt $array->[ $j + $shell ];
                  $j -= $shell ) {
                @$array[ $j, $j + $shell ] = @$array[ $j + $shell, $j ];
                $nswap++;
            }
        }
    } while $shell > 1;
    print "shellsort:   ", scalar @$array,
          " elements, $ncomp comparisons, $nswap swaps\n";
}

If we test shellsort alongside the earlier bubblesort() and bubblesmart() routines, we will see results similar to:

Sorted
bubblesort:  26 elements, 325 comparisons, 0 swaps
bubblesmart: 26 elements, 25 comparisons, 0 swaps
shellsort:   26 elements, 78 comparisons, 0 swaps
Reverse
bubblesort:  26 elements, 325 comparisons, 325 swaps
bubblesmart: 26 elements, 325 comparisons, 325 swaps
shellsort:   26 elements, 97 comparisons, 35 swaps
Append
bubblesort:  31 elements, 465 comparisons, 115 swaps
bubblesmart: 31 elements, 145 comparisons, 115 swaps
shellsort:   31 elements, 133 comparisons, 44 swaps
Random
bubblesort:  26 elements, 325 comparisons, 138 swaps
bubblesmart: 26 elements, 231 comparisons, 138 swaps
shellsort:   26 elements, 115 comparisons, 44 swaps
The first steps of shellsort
Figure 4-6. The first steps of shellsort

In Figure 4-6, the shell distance begins at 6, and the innermost loop makes shell-sized hops backwards in the array, swapping whenever needed. The shellsort() subroutine implements this sort.

The average performance of shellsort is very good, but somewhat hard to analyze; it is thought to be something like , or possibly . The worst case is . The exact performance characteristics of shellsort are difficult to analyze because they depend on the series chosen for $shell.

Log-Linear Sorting Algorithms

In this section, we’ll explore some sorts: mergesort, heapsort, and quicksort.

Mergesort

Mergesort is a divide-and-conquer strategy (see Section 1.3 in Chapter 1). The “divide” step literally divides the array in half. The “conquer” is the merge operation: the halved arrays are recombined to form the sorted array.

To illustrate these steps, assume we have only two elements in each subarray. Either the elements are already in the correct order, or they must be swapped. The merge step scans those two already sorted subarrays (which can be done in linear time), and from the elements picks the smallest and places it in the result array. This is repeated until no more elements remain in the two subarrays. Then, on the next iteration, the resulting larger subarrays are merged, and so on. Eventually, all the elements are merged into one array:

sub mergesort {
    mergesort_recurse ($_[0], 0, $#{ $_[0] });
}

sub mergesort_recurse {
    my ( $array, $first, $last ) = @_;

    if ( $last > $first ) {
        local $^W = 0;               # Silence deep recursion warning.
        my $middle = int(( $last + $first ) / 2);

        mergesort_recurse( $array, $first,       $middle );
        mergesort_recurse( $array, $middle + 1,  $last   );
        merge( $array, $first, $middle, $last );
    }
}

my @work; # A global work array.

sub merge {
    my ( $array, $first, $middle, $last ) = @_;

    my $n = $last - $first + 1;

    # Initialize work with relevant elements from the array.
    for ( my $i = $first, my $j = 0; $i <= $last; ) {
        $work[ $j++ ] = $array->[ $i++ ];
    }

    # Now do the actual merge.  Proceed through the work array
    # and copy the elements in order back to the original array.
    # $i is the index for the merge result, $j is the index in
    # first half of the working copy, $k the index in the second half.

    $middle = int(($first + $last) / 2) if $middle > $last;

    my $n1 = $middle - $first + 1;    # The size of the 1st half.

    for ( my $i = $first, my $j = 0, my $k = $n1; $i <= $last; $i++ ) {
        $array->[ $i ] =
            $j < $n1 &&
              ( $k == $n || $work[ $j ] lt $work[ $k ] ) ?
                $work[ $j++ ] :
                $work[ $k++ ];
    }
}

Notice how we silence warnings with local $^W = 0;. Silencing warnings is bad etiquette, but currently that’s the only way to make Perl stop groaning about the deep recursion of mergesort. If a subroutine calls itself more than 100 times and Perl is run with the -w switch, Perl gets worried and exclaims, Deep recursion on subroutine .... The -w switch sets the $^W to true; we locally set it to false for the duration of the sort.

Mergesort is a very good sort algorithm. It scales well and is insensitive to the key distribution of the input: . This is obvious because each merge is , and repetitively halving elements takes rounds. The bad news is that the traditional implementation of mergesort requires additional temporary space equal in size to the input array.

Mergesort’s recursion can be avoided easily by walking over the array with a working area that starts at 2 and doubles its size at each iteration. The inner loop does merges of the same size.

sub mergesort_iter ($) {
    my ( $array ) = @_;

    my $N    = @$array;
    my $Nt2  = $N * 2; # N times 2.
    my $Nm1  = $N - 1; # N minus 1.

    $#work = $Nm1;

    for ( my $size = 2; $size < $Nt2; $size *= 2 ) {
        for ( my $first = 0; $first < $N; $first += $size ) {
            my $last = $first + $size - 1;
            merge( $array,
                   $first,
                   int(($first + $last) / 2),
                   $last < $N ? $last : $Nm1 );
        }
    }
}

Heapsort

As its name suggests, the heapsort uses the heap data structure described in Section 3.8 in Chapter 3. In a sense, heapsort is similar to selection sort. It finds the largest element and moves it to the end. But the heap structure permits heapsort to avoid the expense of a full search to find each element, allowing the previously determined order to be used in subsequent passes.

use integer;.
sub heapify;

sub heapsort {
    my $array = shift;

    foreach ( my $index = 1 + @$array / 2; $index--; ) {
        heapify $array, $index;
    }

    foreach ( my $last = @$array; --$last; ) {
        @{ $array }[ 0, $last ] = @{ $array }[ $last, 0 ];
        heapify $array, 0, $last;
    }
}

sub heapify {
    my ($array, $index, $last) = @_;

    $last = @$array unless defined $last;

    my $swap = $index;
    my $high = $index * 2 + 1;

    foreach ( my $try = $index * 2;
                 $try < $last && $try <= $high;
                 $try ++ ) {
        $swap = $try if $array->[ $try ] gt $array->[ $swap ];
    }

    unless ( $swap == $index ) {
        # The heap is in disorder: must reshuffle.
        @{ $array }[ $swap, $index ] = @{ $array }[ $index, $swap ];
        heapify $array, $swap, $last;
    }
}

Heapsort is a nice overall algorithm. It is one of the fastest sorting algorithms, it scales well, and it is insensitive, yielding performance. Furthermore, the first element is available in time, and each subsequent element takes time. If you only need the first elements of a set, in order, you can sort them in time in general, and in time if is known in advance.

Heapsort is unstable, but for certain data structures, particularly those used in graph algorithms (see Chapter 8), it is the sorting algorithm of choice.

Quicksort

Quicksort is a well-known divide-and-conquer algorithm. So well-known, in fact, that Perl uses it for implementing its own sort. Quicksort is a good compromise when no characteristics of the input are known.

The basic idea is to pick one element of the array and shuffle it to its final place. That element is known as the pivot, and the shuffling is known as partitioning. The pivot divides the array into two partitions (at some points three; more about this shortly). These two partitions are then recursively quicksorted. A moderately good first guess for the pivot is the last element, but that can lead into trouble with certain input data, as we’ll see.

The partitioning does all the work of comparing and exchanging the elements. Two scans proceed in parallel, one from the beginning of the array and the other from the end. The first scan continues until an element larger than the pivot is found. The second scan continues until an element smaller than the pivot is found. If the scans cross, both stop. If none of the conditions terminating the scans are triggered, the elements at the first and second scan positions are exchanged. After the scans, we exchange the element at the first scan and the pivot.

The partitioning algorithm is as follows:

  1. At Point 1 (see the partition() subroutine) the elements in positions $first..$i-1 are all less than or equal to the pivot, the elements in $j+1..$last-1 are all greater than or equal to the pivot, and the element in $last is equal to the pivot.

  2. At Point 2 the elements in $first..$i-1 are all less than or equal to the pivot, the elements in $j+1..$last-1 are all greater than or equal to the pivot, the elements in $j+1..$i-1 are all equal to the pivot, and the element at $last is equal to the pivot.

  3. At Point 3 we have a three way partitioning. The first partition contains elements that are less than or equal to the pivot; the second partition contains elements that are all equal to the pivot. (There must be at least one of these—the original pivot element itself.) The third partition contains elements that are greater than or equal to the pivot. Only the first and third partitions need further sorting.

The quicksort algorithm is illustrated in Figure 4-7.

The first steps of quicksort
Figure 4-7. The first steps of quicksort

First, let’s look at the partition subroutine:

sub partition {
    my ( $array, $first, $last ) = @_;
my $i = $first;
    my $j = $last - 1;
    my $pivot = $array->[ $last ];
SCAN: {
        do {
            # $first <= $i <= $j <= $last - 1
            # Point 1.
# Move $i as far as possible.
            while ( $array->[ $i ] le $pivot ) {  
                $i++;
                last SCAN if $j < $i;
            }
# Move $j as far as possible.
            while ( $array->[ $j ] ge $pivot ) {
                $j--;
                last SCAN if $j < $i;
            }
# $i and $j did not cross over, so swap a low and a high value.
            @$array[ $j, $i ] = @$array[ $i, $j ];
        } while ( --$j >= ++$i );
    }
    # $first - 1 <= $j < $i <= $last
    # Point 2.
# Swap the pivot with the first larger element (if there is one).
    if ( $i < $last ) {
        @$array[ $last, $i ] = @$array[ $i, $last ];
        ++$i;
    }
# Point 3.
return ( $i, $j );   # The new bounds exclude the middle.
}

You can think of the partitioning process as a filter: the pivot introduces a little structure to the data by dividing the elements into less-or-equal and greater-or-equal portions. After the partitioning, the quicksort itself is quite simple. We again silence the deep recursion warning, as we did in mergesort().

sub quicksort_recurse {
    my ( $array, $first, $last ) = @_;

    if ( $last > $first ) {
        my ( $first_of_last, $last_of_first, ) =
                                partition( $array, $first, $last );

        local $^W = 0;               # Silence deep recursion warning.
        quicksort_recurse $array, $first,         $last_of_first;
        quicksort_recurse $array, $first_of_last, $last;
    }
}

sub quicksort {
    # The recursive version is bad with BIG lists
    # because the function call stack gets REALLY deep.
    quicksort_recurse $_[ 0 ], 0, $#{ $_[ 0 ] };
}

The performance of the recursive version can be enhanced by turning recursion into iteration; see Section 4.2.2.3.1.

If you expect that many of your keys will be the same, try adding this before the return in partition():

# Extend the middle partition as much as possible.
++$i while $i <= $last  && $array->[ $i ] eq $pivot;
--$j while $j >= $first && $array->[ $j ] eq $pivot;

This is the possible third partition we hinted at earlier.

On average, quicksort is a very good sorting algorithm. But not always: if the input is fully or close to being fully sorted or reverse sorted, the algorithms spends a lot of effort exchanging and moving the elements. It becomes as slow as bubble sort on random data: .

This worst case can be avoided most of the time by techniques such as the median-of-three: Instead of choosing the last element as the pivot, sort the first, middle, and last elements of the array, and then use the middle one. Insert the following before $pivot = $array->[ $last ] in partition():

my $middle = int( ( $first + $last ) / 2 );

@$array[ $first, $middle ] = @$array[ $middle, $first ]
   if $array->[ $first ] gt $array->[ $middle ];

@$array[ $first, $last ] = @$array[ $last, $first ]
   if $array->[ $first ] gt $array->[ $last ];

# $array[$first] is now the smallest of the three.
# The smaller of the other two is the middle one:
# It should be moved to the end to be used as the pivot.
@$array[ $middle, $last ] = @$array[ $last, $middle ]
   if $array->[ $middle ] lt $array->[ $last ];

Another well-known shuffling technique is simply to choose the pivot randomly. This makes the worst case unlikely, and even if it does occur, the next time we choose a different pivot, it will be extremely unlikely that we again hit the worst case. Randomization is easy; just insert this before $pivot = $array->[ $last ]:

my $random = $first + rand( $last - $first + 1 );
@$array[ $random, $last ] = @$array[ $last, $random ];

With this randomization technique, any input gives an expected running time of . We can say the randomized running time of quicksort is . However, this is slower than median-of-three, as you’ll see in Figure 4-8 and Figure 4-9.

Removing recursion from quicksort

Quicksort uses a lot of stack space because it calls itself many times. You can avoid this recursion and save time by using an explicit stack. Using a Perl array for the stack is slightly faster than using Perl’s function call stack, which is what straightforward recursion would normally use:

sub quicksort_iterate {
    my ( $array, $first, $last ) = @_;
    my @stack = ( $first, $last );

    do {
        if ( $last > $first ) {
            my ( $last_of_first, $first_of_last ) =
                partition $array, $first, $last;

            # Larger first.
            if ( $first_of_last - $first > $last - $last_of_first ) {
                push @stack, $first, $first_of_last;
                $first = $last_of_first;
            } else {
                push @stack, $last_of_first, $last;
                $last = $first_of_last;
            }
        } else { 
            ( $first, $last ) = splice @stack, -2, 2;  # Double pop.
        }
    } while @stack;
}

sub quicksort_iter {
    quicksort_iterate $_[0], 0, $#{ $_[0] };
}

Instead of letting the quicksort subroutine call itself with the new partition limits, we push the new limits onto a stack using push and, when we’re done, pop the limits off the stack with splice. An additional optimizing trick is to push the larger of the two partitions onto the stack and process the smaller partition first. This keeps @stack shallow. The effect is shown in Figure 4-8.

Effect of the quicksort enhancements for random data
Figure 4-8. Effect of the quicksort enhancements for random data

As you can see from Figure 4-8, these changes don’t help if you have random data. In fact, they hurt. But let’s see what happens with ordered data.

Effect of the quicksort enhancements for ordered data
Figure 4-9. Effect of the quicksort enhancements for ordered data

The enhancements in Figure 4-9 are quite striking. Without them, ordered data takes quadratic time; with them, the log-linear behavior is restored.

In Figure 4-8 and Figure 4-9, the x-axis is the number of records, scaled to 1.0. The y-axis is the relative running time, 1.0 being the time taken by the slowest algorithm (bubble sort). As you can see, the iterative version provides a slight advantage, and the two shuffling methods slow down the process a bit. But for already ordered data, the shuffling boosts the algorithm considerably. Furthermore, median-of-three is clearly the better of the two shuffling methods.

Quicksort is common in operating system and compiler libraries. As long as the code developers sidestepped the stumbling blocks we discussed, the worst case is unlikely to occur.

Quicksort is unstable: records having identical keys aren’t guaranteed to retain their original ordering. If you want a stable sort, use mergesort.

Median, quartile, percentile

A common task in statistics is finding the median of the input data. The median is the element in the middle; the value has as many elements less than itself as it has elements greater than itself.

median() finds the index of the median element. The percentile() allows even more finely grained slicing of the input data; for example, percentile($array, 95) finds the element at the 95th percentile. The percentile() subroutine can be used to create subroutines like quartile() and decile().

We’ll use a worst-case linear algorithm, subroutine selection(), for finding the th element and build median() and further functions on top of it. The basic idea of the algorithm is first to find the median of medians of small partitions (size 5) of the original array. Then we either recurse to earlier elements, are happy with the median we just found and return that, or recurse to later elements:

use constant PARTITION_SIZE => 5;

# NOTE 1: the $index in selection() is one-based, not zero-based as usual.
# NOTE 2: when $N is even, selection() returns the larger of
#         "two medians", not their average as is customary--
#         write a wrapper if this bothers you.sub selection {
    # $array:   an array reference from which the selection is made.
    # $compare: a code reference for comparing elements,
    #           must return -1, 0, 1.
    # $index:   the wanted index in the array.
    my ($array, $compare, $index) = @_;

    my $N = @$array;
# Short circuit for partitions.
    return (sort { $compare->($a, $b) } @$array)[ $index-1 ]
         if $N <= PARTITION_SIZE;

    my $medians;

    # Find the median of the about $N/5 partitions.
    for ( my $i = 0; $i < $N; $i += PARTITION_SIZE ) {
        my $s =                 # The size of this partition.
            $i + PARTITION_SIZE < $N ?
                PARTITION_SIZE : $N - $i;
my @s =                 # This partition sorted.
            sort { $array->[ $i + $a ] cmp $array->[ $i + $b ] }
                 0 .. $s-1;
        push @{ $medians },     # Accumulate the medians.
             $array->[ $i + $s[ int( $s / 2 ) ] ];
    }

    # Recurse to find the median of the medians.
    my $median = selection( $medians, $compare, int( @$medians / 2 ) );
    my @kind;

    use constant LESS    => 0;
    use constant EQUAL   => 1;
    use constant GREATER => 2;

    # Less-than    elements end up in @{$kind[LESS]},
    # equal-to     elements end up in @{$kind[EQUAL]},
    # greater-than elements end up in @{$kind[GREATER]}.
    foreach my $elem (@$array) {
        push @{ $kind[$compare->($elem, $median) + 1] }, $elem;
    }

    return selection( $kind[LESS], $compare, $index )
        if $index <= @{ $kind[LESS]  };

    $index -= @{ $kind[LESS] };

    return $median
        if $index <= @{ $kind[EQUAL] };
 
    $index -= @{ $kind[EQUAL] };

    return selection( $kind[GREATER], $compare, $index );
}

sub median {
    my $array = shift;
    return selection( $array,
                      sub { $_[0] <=> $_[1] },
                      @$array / 2 + 1 );
}

sub percentile {
    my ($array, $percentile) = @_;
    return selection( $array,
                      sub { $_[0] <=> $_[1] },
                      (@$array * $percentile) / 100 );
}

We can find the top decile of a range of test scores as follows:

@scores = qw(40 53 77 49 78 20 89 35 68 55 52 71);

print percentile(\@scores, 90), "\n";

This will be:

77

Beating O(N log N)

All the sort algorithms so far have been “comparison” sorts—they compare keys with each other. It can be proven that comparison sorts cannot be faster than . However you try to order the comparisons, swaps, and inserts, there will always be at least of them. Otherwise, you couldn’t collect enough information to perform the sort.

It is possible to do better. Doing better requires knowledge about the keys before the sort begins. For instance, if you know the distribution of the keys, you can beat . You can even beat knowing only the length of the keys. That’s what the radix sort does.

Radix sorts

There are many radix sorts. What they all have in common is that each uses the internal structure of the keys to speed up the sort. The radix is the unit of structure; you can think it as the base of the number system used. Radix sorts treat the keys as numbers (even if they’re strings) and look at them digit by digit. For example, the string ABCD can be seen as a number in base 256 as follows: .

The keys have to have the same number of bits because radix algorithms walk through them all one by one. If some keys were shorter than others, the algorithms would have no way of knowing whether a key really ended or it just had zeroes at the end. Variable length strings therefore have to be padded with zeroes (\x00) to equalize the lengths.

Here, we present the straight radix sort, which is interesting because of its rather counterintuitive logic: the keys are inspected starting from their ends. We’ll use a radix of because it holds all 8-bit characters. We assume that all the keys are of equal length and consider one character at a time. (To consider characters at a time, the keys would have to be zero-padded to a length evenly divisible by ). For each pass, $from contains the results of the previous pass: 256 arrays, each containing all of the elements with that 8-bit value in the inspected character position. For the first pass, $from contains only the original array.

Radix sort is illustrated in Figure 4-10 and implemented in the radix_sort() subroutine as follows:

sub radix_sort {
    my $array = shift;

    my $from = $array;
    my $to;

    # All lengths expected equal.
    for ( my $i = length $array->[ 0 ] - 1; $i >= 0; $i-- ) {
        # A new sorting bin.
        $to = [ ];
        foreach my $card ( @$from ) {
            # Stability is essential, so we use push().
            push @{ $to->[ ord( substr $card, $i ) ] }, $card;
        }

        # Concatenate the bins.

        $from = [ map { @{ $_ || [ ] } } @$to ];
    }

    # Now copy the elements back into the original array.

    @$array = @$from;
}
The radix sort
Figure 4-10. The radix sort

We walk through the characters of each key, starting with the last. On each iteration, the record is appended to the “bin” corresponding to the character being considered. This operation maintains the stability of the original order, which is critical for this sort. Because of the way the bins are allocated, ASCII ordering is unavoidable, as we can see from the misplaced wolf in this sample run:

@array = qw(flow loop pool Wolf root sort tour);
radix_sort(\@array);
print "@array\n";
Wolf flow loop pool root sort tour

For you old-timers out there, yes, this is how card decks were sorted when computers were real computers and programmers were real programmers. The deck was passed through the machine several times, one round for each of the card columns in the field containing the sort key. Ah, the flapping of the cards...

Radix sort is fast: , where is the length of the keys, in bits. The price is the time spent padding the keys to equal length.

Counting sort

Counting sort works for (preferably not too sparse) integer data. It simply first establishes enough counters to span the range of integers and then counts the integers. Finally, it constructs the result array based on the counters.

sub counting_sort {
    my ($array, $max) = @_; # All @$array elements must be 0..$max.
    my @counter = (0) x ($max+1);
    foreach my $elem ( @$array ) { $counter[ $elem ]++ }
    return map { ( $_ ) x $count[ $_ ] } 0..$max;
}

Hybrid sorts

Often it is worthwhile to combine sort algorithms, first using a sort that quickly and coarsely arranges the elements close to their final positions, like quicksort, radix sort, or mergesort. Then you can polish the result with a shell sort, bubble sort, or insertion sort—preferably the latter two because of their unparalleled speed for nearly sorted data. You’ll need to tune your switch point to the task at hand.

Bucket sort

Earlier we noted that inserting new books into a bookshelf resembles an insertion sort. However, if you’ve only just recently learned to read and suddenly have many books to insert into an empty bookcase, you need a bucket sort. With four shelves in your bookcase, a reasonable first approximation would be to pile the books by the authors’ last names: A-G, H-N, O-S, T-Z. Then you can lift the piles to the shelves, and polish the piles with a fast insertion sort.

Bucket sort is very hard to beat for uniformly distributed numerical data. The records are first dropped into the right bucket. Items near each other (after sorting) belong to the same bucket. The buckets are then sorted using some other sort; here we use an insertion sort. If the buckets stay small, the running time of insertion sort doesn’t hurt. After this, the buckets are simply concatenated. The keys must be uniformly distributed; otherwise, the size of the buckets becomes unbalanced and the insertion sort slows down. Our implementation is shown in the bucket_sort() subroutine:

use constant BUCKET_SIZE => 10;

sub bucket_sort {
    my ($array, $min, $max) = @_;
    my $N = @$array or return;

    my $range    = $max - $min;
    my $N_BUCKET = $N / BUCKET_SIZE;
    my @bucket;

    # Create the buckets.
    for ( my $i = 0; $i < $N_BUCKET; $i++ ) {
        $bucket[ $i ] = [ ];
    }

    # Fill the buckets.
    for ( my $i = 0; $i < $N; $i++ ) {
        my $bucket = $N_BUCKET * (($array->[ $i ] - $min)/$range);
        push @{ $bucket[ $bucket ] }, $array->[ $i ];
    }# Sort inside the buckets.
    for ( my $i = 0; $i < $N_BUCKET; $i++ ) {
        insertion_sort( $bucket[ $i ] );
    }

    # Concatenate the buckets.

    @{ $array } = map { @{ $_ } } @bucket;
}

If the numbers are uniformly distributed, the bucket sort is quite possibly the fastest way to sort numbers.

Quickbubblesort

To further demonstrate hybrid sorts, we’ll marry quicksort and bubble sort to produce quickbubblesort, or qbsort() for short. We partition until our partitions are narrower than a predefined threshold width, and then we bubble sort the entire array. The partitionMo3() subroutine is the same as the partition() subroutine we used earlier, except that the median-of-three code has been inserted immediately after the input arguments are copied.

sub qbsort_quick;
sub partitionMo3;

sub qbsort {
    qbsort_quick $_[0], 0, $#{ $_[0] }, defined $_[1] ? $_[1] : 10;
    bubblesmart   $_[0]; # Use the variant that's fast for almost sorted data.
}

# The first half of the quickbubblesort: quicksort.
# A completely normal quicksort (using median-of-three)
# except that only partitions larger than $width are sorted.

sub qbsort_quick {
    my ( $array, $first, $last, $width ) = @_;
    my @stack = ( $first, $last );
do {
        if ( $last - $first > $width ) {
            my ( $last_of_first, $first_of_last ) =
                partitionMo3( $array, $first, $last );

            if ( $first_of_last - $first > $last - $last_of_first ) {
                push @stack, $first, $first_of_last;
                $first = $last_of_first;
            } else {
                push @stack, $last_of_first, $last;
                $last = $first_of_last;
            }
        } else { # Pop.
            ( $first, $last ) = splice @stack, -2, 2;
        }
    } while @stack;
}
sub partitionMo3 {
    my ( $array, $first, $last ) = @_;
use integer;
my $middle = int(( $first + $last ) / 2);
# Shuffle the first, middle, and last so that the median
    # is at the middle.
@$array[ $first, $middle ] = @$array[ $middle, $first ]
        if ( $$array[ $first ] gt $$array[ $middle ] );
@$array[ $first, $last ] = @$array[ $last, $first ]
        if ( $$array[ $first ] gt $$array[ $last ] );
@$array[ $middle, $last ] = @$array[ $last, $middle ]
        if ( $$array[ $middle ] lt $$array[ $last ] );
my $i = $first;
    my $j = $last - 1;
    my $pivot = $$array[ $last ];
# Now do the partitioning around the median.
SCAN: {
        do {
            # $first <= $i <= $j <= $last - 1
            # Point 1.
# Move $i as far as possible.
            while ( $$array[ $i ] le $pivot ) {
                $i++;
                last SCAN if $j < $i;
            }

            # Move $j as far as possible.
            while ( $$array[ $j ] ge $pivot ) {
                $j--;
                last SCAN if $j < $i;
            }

            # $i and $j did not cross over,
            # swap a low and a high value.
            @$array[ $j, $i ] = @$array[ $i, $j ];
        } while ( --$j >= ++$i );
    }
    # $first - 1 <= $j <= $i <= $last
    # Point 2.

    # Swap the pivot with the first larger element
    # (if there is one).
    if( $i < $last ) {
        @$array[ $last, $i ] = @$array[ $i, $last ];
        ++$i;
    }

    # Point 3.

    return ( $i, $j );   # The new bounds exclude the middle.
}

The qbsort() default threshold width of 10 can be changed with the optional second parameter. We will see in the final summary (Figure 4-14) how well this hybrid fares.

External Sorting

Sometimes it’s simply not possible to contain all your data in memory. Maybe there’s not enough virtual (or real) memory, or maybe some of the data has yet to arrive when the sort begins. Maybe the items being sorted permit only sequential access, like tapes in a tape drive. This makes all of the algorithms described so far completely impractical: they assume random access devices like disks and memories. When the cost of retrieving or storing an element becomes, say, linearly dependent on its position, all the algorithms we’ve studied so far become at the least because swapping two elements is no longer as we have assumed, but .

We can solve these problems using a divide-and-conquer technique, and the easiest is mergesort. Mergesort is ideal because it reads its inputs sequentially, never looking back. The partial solutions (saved on disk or tape) can then be combined over several stages into the final result. Furthermore, the finished output is generated sequentially, and each datum can therefore be finalized as soon as the merge “pointer” has passed by.

The mergesort we described earlier in this chapter divided the sorting problem into two parts. But there’s nothing special about the number two: in our dividing and conquering, there’s no reason we can’t divide into three or more parts. In external sorting, this multiway-merging may be needed, so that instead of merging only two subsolutions, we can combine several simultaneously.

Sorting Algorithms Summary

Most of the time Perl’s own sort is enough because it implements a fine-tuned quicksort in C. However, if you need a customized sort algorithm, here are some guidelines for choosing one.

Reminder: In our graphs, both axes are scaled to 1.0 because the absolute numbers are irrelevant—that’s the beauty of O-analysis. The 1.0 of the running time axis is the slowest case: bubblesort for random data.

The data set used was a collection of randomly generated strings (except for our version of bucket sort, which understands only numbers). There were 100, 200,..., 1000 strings, with lengths varying from 20 to 100 characters (except for radix sort, which demands equal-length strings). For each algorithm, the tests were run with all three orderings: random, already ordered, and already reverse-ordered. To avoid statistical flutter (the computer used was a multitasking server), each test was run 10 times and the running times (CPU time, not real time) were averaged.

To illustrate the fact that the worst case behavior of the algorithm has very little to do with the computing power, comprehensive tests were run on four different computers, resulting in Figure 4-11. An insertion sort on random data was chosen for the benchmark because it curves quite nicely. The computers sported three different CPU families, the frequencies of the CPUs varied by a factor of 7, and the real memory sizes of the hosts varied by a factor of 64. Due to these large differences the absolute running times varied by a factor of 4, but since the worst case doesn’t change, the curves all look similar.

The irrelevance of the computer architecture
Figure 4-11. The irrelevance of the computer architecture

O(N2) Sorts

In this section, we’ll compare selection sort, bubble sort, and insertion sort.

Selection sort

Selection sort is insensitive, but to little gain: performance is always . It always does the maximum amount of work that one can actually do without repeating effort. It is possible to code stably, but not worth the trouble.

Bubble sort and insertion sort

Don’t use bubble sort or insertion sort by themselves because of their horrible average performance, , but remember their phenomenal nearly linear performance when the data is already nearly sorted. Either is good for the second stage of a hybrid sort. insertion_merge() can be used for merging two sorted collections.

In Figure 4-12, the three upward curving lines are the algorithms, showing you how the bubble, selection, and insertion sorts perform for random data. To avoid cluttering the figure, we show only one log-linear curve and one linear curve. We’ll zoom in to the speediest region soon.

The quadratic, merge, and radix sorts for random data
Figure 4-12. The quadratic, merge, and radix sorts for random data

The bubble sort is the worst, but as you can see, the more records there are, the quicker the deterioration for all three. The second lowest line is the archetypal algorithm: mergesort. It looks like a straight line, but actually curves slightly upwards (much more gently than ). The best-looking (lowest) curve belongs to radix sort: for random data, it’s linear with the number of records.

Shellsort

The shellsort, with its hard-to-analyze time complexity, is in a class of its own:

  • unstable

  • sensitive

Time complexity possibly .

O(N log N) Sorts

Figure 4-13 zooms in on the bottom region of Figure 4-12. In the upper left, the algorithms shoot up aggressively. At the diagonal and clustering below it, the algorithms curve up in a much more civilized manner. At the bottom right are the four algorithms: from top to bottom, they are radix, bucket sort for uniformly distributed numbers, and the bubble and insertion sorts for nearly ordered records.

All the sorting algorithms, mostly for random data
Figure 4-13. All the sorting algorithms, mostly for random data

Mergesort

Always performs well (). The large space requirement (as large as the input) of traditional implementations is a definite minus. The algorithm is inherently recursive, but can and should be coded iteratively. Useful for external sorting.

Quicksort

Almost always performs well——but is very sensitive in its basic form. Its Achilles’ heel is ordered or reversed data, yielding performance. Avoid recursion and use the median-of-three technique to make the worst case very unlikely. Then the behavior reverts to log-linear even for ordered and reversed data. Unstable. If you want stability, choose mergesort.

How Well Did We Do?

In Figure 4-14, we present the fastest general-purpose algorithms (disqualifying radix, bucket, and counting): the iterative mergesort, the iterative quicksort, our iterative median-of-three-quickbubblesort, and Perl’s sort, for both random and ordered data. The iterative quicksort for ordered data is not shown because of its aggressive quadratic behavior.

The fastest general-purpose sorting algorithms
Figure 4-14. The fastest general-purpose sorting algorithms

As you can see, we can approach Perl’s built-in sort, which as we said before is a quicksort under the hood.[21] You can see how creatively combining algorithms gives us much higher and more balanced performance than blindly using one single algorithm.

Here are two tables that summarize the behavior of the sorting algorithms described in this chapter. As mentioned at the very beginning of this chapter, Perl has implemented its own quicksort implementation since Version 5.004_05. It is a hybrid of quicksort-with-median-of-three (quick+mo3 in the tables that follow) and insertion sort. The terminally curious may browse pp_ctl.c in the Perl source code.

Table 4-1 summarizes the performance behavior of the algorithms as well as their stability and sensitivity.

Table 4-1. Performance of Sorting Algorithms

Sort

Random

Ordered

Reversed

Stability

Sensitivity

selection

stable

insensitive

bubble

unstable

sensitive

insertion

stable

sensitive

shell

stable

sensitive

merge

stable

insensitive

heap

unstable

insensitive

quick

unstable

sensitive

quick+mo3

unstable

insensitive

radix

stable

insensitive

counting

stable

insensitive

bucket

stable

sensitive

The quick+mo3 is quicksort with the median-of-three enhancement. “Almost ordered” and “almost reversed” behave like their perfect counterparts...almost.

Table 4-2 summarizes the pros and cons of the algorithms.

Table 4-2. Pros and Cons of Sorts

Sort

Advantages

Disadvantages

selection

stable, insensitive

bubble

for nearly sorted

otherwise

insertion

for nearly sorted

otherwise

shell

worse than

merge

, stable, insensitive temporary workspace

heap

, insensitive

unstable

quick

unstable, sensitive ( at worst)

quick+mo3

, insensitive

unstable

radix

, stable, insensitive

only for strings of equal length

counting

, stable, insensitive

only for integers

bucket

, stable

only for uniformly distributed numbers

J. R. R. Tolkien, The Lord of the Rings

“No, not at the rear!” the slave-driver shouted. “Three files up. And stay there, or you’ll know it, when I come down the line!”



[12] The (3) is Unix-speak and means documentation section 3, the libraries. On a Unix system, man qsort will display the documentation.

[13] Actually, there is at least one port of Perl, to the IBM System/390, which uses another ordering, EBCDIC.

[14] The Royal Academy at Madrid recently gave in a bit, thanks to the stupidity of computers: handling the letter ch as c and h is now acceptable.

[15] You LISP hackers will recognize the trick.

[16] Strictly speaking, the “left” and “right” are misnomers: left means “the first elements of the anonymous lists” and right means “the second elements of the anonymous lists.”

[17] ISO Latin 1 is a character encoding like ASCII. In fact, ASCII and the first half of ISO Latin 1 are identical. The second half of ISO Latin 1 contains many of the accented characters of several Western European languages.

[18] 200-MHz Pentium Pro, 64 MB memory, NetBSD 1.2G.

[19] As in Rigoletto: La donna è mobile.

[20] Expert poker and bridge players don’t do this, however. They leave their cards unsorted because moving the cards around reveals information.

[21] The better qsort() implementations actually are also hybrids, often quicksort combined with insertion sort.

Get Mastering Algorithms with Perl now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.