Chapter 6. Benchmarking Perl

Tony Hoare’s famous quote—“Premature optimization is the root of all evil”—usually doesn’t come with its setup: “We should forget about small efficiencies, say about 97% of the time.” That is, don’t sweat the small stuff until you need to. In this chapter, I show how I can look into my Perl programs to see where the slow parts are. Before I start working to improve the performance of my program, I should check to see what the program is actually doing. Once I know where the slow parts are, I concentrate on those.

Benchmarking Theory

The term benchmark comes from surveyors. They create a physical mark in something to denote a known elevation and use that mark to determine other elevations. Those computed elevations can only be right if the original mark is right. Even if the original mark started off right, maybe it changed because it sunk into the ground, the ground moved because of an earthquake, or global warming redefined the ultimate benchmark we call sea level.^[27]Benchmarks are comparisons, not absolutes.

For computers, a benchmark compares the performance of one system against another. They measure in many dimensions, including time to completion, resource use, network activity, or memory use. Several tools already exist for measuring the parts outside of Perl so I won’t cover those here. I want to look inside Perl to see what I can find. I want to know if one bit of code is faster or uses less memory.

Measuring things and extracting numbers is easy, and it’s often easy for us to believe the numbers that computers give us. This makes benchmarking dangerous. Unlike those surveyors, we can’t stand on a hill and know if we are higher or lower than the next hill by just looking. We have to carefully consider not only the numbers that we get from benchmarks, but the method we use to generate the numbers.

Benchmarking isn’t as popular as it used to be. The speed and storage of computers and the bandwidth of networks are not as limiting as they used to be, so we don’t feel like we have to work hard to conserve them. We also don’t have to pay (as in money, literally) for CPU cycles (in most cases), so we don’t care how many we actually use. At least, we don’t care as much as programmers used to care. After all, you’re using Perl, aren’t you?

Any measurement comes with risk. If I don’t understand what I’m measuring, what affects the measurement, or what the numbers actually mean, I can easily misinterpret the results. If I’m not careful about how I measure things, my numbers may be meaningless. I can let the computer do the benchmarking work for me, but I shouldn’t let it do the thinking for me.

A Perl program doesn’t run on its own. It depends on a perl interpreter, an operating system, and hardware. Each of those things depends on other things. Even if I use the same machine, different perl interpreters, even of the same version of Perl, may give different results. I could have compiled them with different C compilers that have different levels of optimization, I could have included different features in one interpreter, and so on. I’ll talk about this more toward the end of the chapter when I discuss perlbench.

You probably don’t have to imagine a situation where you develop on one platform but deploy on another. I get to visit many companies in my travels as a consultant with Stonehenge, so I’ve been able to see a lot of different setups. Often, teams develop on one system that only they use, and then deploy the result to a busy server that has a different version of Perl, a different version of the operating system, and a completely different load profile. What was quite speedy on a lightly used machine becomes unbearably slow when people start to use it. A good example of this is CGI programs, which become quite slow with increased load, versus speedy mod_perl programs, which scale quite nicely.

Any benchmark only applies to its situation. Extrapolating my results might not get me in trouble, but they aren’t really valid either. The only way for me to really know what will happen in a particular situation is to test that situation. Along with my numbers, I have to report the details. It’s not enough to say, for instance, that I’m writing this on a Powerbook G4 running Mac OS 10.4.4. I have to tell you the details of my perl interpreter, how I compiled it (that’s just perl -V), and how I’ve tuned my operating system.

Also, I can’t measure something without interacting with it, and that interaction changes the situation. If I want to watch the details of Perl’s memory management, for instance, I can compile Perl with -DDEBUGGING_MSTATS, but then it’s not the same Perl interpreter. Although I can now watch the memory, I’ve probably slowed the entire program down (and I verify that at the end of this chapter when I show perlbench). If I add code to time the program, I have to execute that code, which means my program takes longer. In any case, I might have to use additional modules, which means that Perl has to find, load, and compile more code.

Benchmarking Time

To measure the time it takes my program to run, I could just look at the clock on the wall when I start the program and look again when the program finishes. That’s the simplest way, and the most naive, too. This method might work in extreme circumstances. If I can reduce the run time of my program from an entire workday to a couple of minutes, then I don’t care that the wallclock method might be a bit inaccurate.

I don’t have to really look at my watch, though. I can time my program directly in my program if I like:

#!/usr/bin/perl

my $start = time;

#... the meat of my program

my $end   = time;

print "The total time is ", $end - $start;

For a short-running program, this method only tests a portion of the runtime. What about all that time Perl spent compiling the code? If I used a lot of modules, a significant part of the time the whole process takes might be in the parts before Perl even starts running the program. Jean-Louis Leroy wrote an article for Perl.com^[28]about slow startup times in a Perl FTP program because Perl had to look through 23 different directories to find everything Net::FTP needed to load. The runtime portion is still pretty speedy, but the startup time was relatively long. Remember that Perl has to compile the program every time I run it (forgetting about things like mod_perl for the moment). If I use many modules, I make a lot of work for Perl to find them and compile them every time I run my program.

If I want to time the whole process, compile time and runtime, I can create a wrapper around the program to do the wallclock timing. I could take this number and compare it to the runtime numbers to estimate the compilation times:

#!/usr/bin/perl

my $start = time;

system( "@ARGV" );

my $end   = time;

printf "The whole time was %d seconds", $end - $start;

The wallclock method breaks down, though, because the operating system can switch between tasks, or even run different tasks at the same time. I can’t tell how much time the computer worked on my program by only looking at my watch. The situation is even worse when my program has to wait for resources that might be busy or for network latency. I can’t really blame my program in those cases.

The time program (not the Perl built-in) that comes with most unix-like systems solves this by reporting only the time that the operating system thinks about my program. Your particular shell may even have a built-in command for it.^[29]

From the command line, I tell the time command what it should measure. It runs the command and reports its results. It breaks down the runtime down by the real time, the user time, and the system time. The real time is the wallclock time. The other two deal with how the operating system divides tasks between the system and the my process. Mostly I don’t care about that distinction and only their sum matters to me.

When I time the sleep program (not the Perl built-in), the real time is the time I told it to sleep, but since that’s all that program does, the user and system times are minuscule. The output for your particular version of time may be different:

$ time sleep 5

real    0m5.094s
user    0m0.002s
sys     0m0.011s

Behind the scenes, the time program just uses the times function from the standard C library, and that carries along accounting information (although we’re fortunate that we don’t have to pay for clock cycles anymore). The times Perl built-in does the same thing. In list context, it returns four times: the total user and system time, and the user and system time for the children of the process. I take the end times and subtract the starting times to get the real times:

#!/usr/bin/perl

use Benchmark;

my @start = times;

#... the meat of my program

my @end   = times;

my @diffs = map { $end[$_] - $start[$_] } 0 .. $#end;

print "The total time is @diffs";

I don’t have to do those calculations myself, though, because the Benchmark module, which comes with Perl, already does it for me. Again, this approach only measures the runtime:

#!/usr/bin/perl

use Benchmark;

my $start = Benchmark->new;

#... the meat of my program

my $end = Benchmark->new;

my $diff = timediff( $t1, $t0 );

    print "My program took: " . timestr( $diff ) . "\n";

( $real, $child_user, $child_system ) = @$diff[0,3,4];

# I'm pretty sure this is POSIX format
printf STDERR "\nreal\t%.3f\nuser\t%.3f\nsys\t%.3f\n",
                          $real, $child_user, $child_system;

The output looks like the times output I showed previously, but now it comes completely from within my Perl program and just for the parts of the program inside of the calls to Benchmark->new. Instead of timing the entire program, I can focus on the part I want to examine.

This is almost the same thing David Kulp did to create the Perl Power Tools version of time. Take a benchmark, run the command of interest using system (so those are the children times), and then take another benchmark once system returns. Since this version of time is pure Perl, it runs anywhere that Perl runs:

#!/usr/bin/perl

use Benchmark;

$t0 = Benchmark->new;

$rc = system( @ARGV );

$t1 = Benchmark->new;

$diffs = timediff( $t1, $t0 );

printf STDERR "\nreal %.2f\nuser %.2f\nsys  %.2f\n", @$diffs[0,3,4];

$rc &= 0xffff;
if ($rc == 0xff00) { exit 127; } else { exit ($rc >> 8); }

There’s a big problem with measuring CPU times and comparing them to program perfomance: they only measure the time my program used the CPU. It doesn’t include the time that my program waits to get input, to send output, or to get control of some other resource. Those times might be much more important that the CPU time.

Comparing Code

Benchmarks by themselves aren’t very useful. I file them under the heading of “decision support.” I might be able to use them to decide that I need to change a program to improve a number, but the number itself doesn’t tell me what to do. Sure, I know how long it takes to run my program, but it doesn’t tell me if I can make it any faster. I need to compare one implementation to another.

I could compare entire programs to each other, but that’s not very useful. If I’m trying to speed up a program, for instance, I’m going to change the parts that I think are slow. Most of the other parts will be the same, and the time to run all of those same parts end up in the total time. I really just want to compare the bits that are different. The times for the rest of the code skews the results, so I need to isolate the parts that I want to compare.

If I extract the different parts, I can create small programs with just those. Most of the time the sample program takes to run then only applies to the interesting bits. I’ll talk more about that later, but as I go through this next section, remember that anything I do has some overhead and every measurement changes the situation a bit, so I should think about the numbers before I accept them. For now, I’ll go back to the Benchmark module.

If I want to compare two small bits of code instead of entire programs, I can use some of the functions from Benchmark. I can compare either by running a certain number of iterations and comparing the total time, or the inverse of that, a total time and comparing the total number of iterations.

In the timethese function from Benchmark, I give it a number of iterations as the first argument. The second argument is an anonymous hash where the keys are labels I give the snippets and the hash values represent the code I want to compare, in this case as string values that Perl will eval. In this sample program, I want to compare the speed of opendir and glob for getting a list of files:

#!/usr/bin/perl

use Benchmark;

my $iterations = 10_000;

timethese( $iterations, {
        'Opendir'  => 'opendir my( $dh ), "."; my @f = readdir( $dh )',
        'Glob'     => 'my @f = glob("*")',
        }
        );

The timethese function prints a nice report that shows me the three times I discussed earlier:

$ perl dir-benchmark.pl
Benchmark: timing 10000 iterations of Glob, Opendir...
      Glob:  6 wallclock secs ( 2.12 usr +  3.47 sys =  5.59 CPU) @ 1788.91/s (n=10000)
   Opendir:  3 wallclock secs ( 0.85 usr +  1.70 sys =  2.55 CPU) @ 3921.57/s (n=10000)

These aren’t “The Numbers,” though. People try to get away with running the measurement once. Try it again. Then again. The results vary a little bit every time you run it; certainly some of this is merely round-off error:

$ perl dir-benchmark.pl
Benchmark: timing 10000 iterations of Glob, Opendir...
        Glob:  6 wallclock secs ( 2.10 usr +  3.47 sys =  5.57 CPU) @ 1795.33/s (n=10000)
     Opendir:  3 wallclock secs ( 0.86 usr +  1.70 sys =  2.56 CPU) @ 3906.25/s (n=10000)

$ perl dir-benchmark.pl
Benchmark: timing 10000 iterations of Glob, Opendir...
        Glob:  7 wallclock secs ( 2.11 usr +  3.51 sys =  5.62 CPU) @ 1779.36/s (n=10000)
     Opendir:  3 wallclock secs ( 0.87 usr +  1.71 sys =  2.58 CPU) @ 3875.97/s (n=10000)

$ perl dir-benchmark.pl
Benchmark: timing 10000 iterations of Glob, Opendir...
        Glob:  7 wallclock secs ( 2.11 usr +  3.47 sys =  5.58 CPU) @ 1792.11/s (n=10000)
     Opendir:  3 wallclock secs ( 0.85 usr +  1.69 sys =  2.54 CPU) @ 3937.01/s (n=10000)

Don’t Turn Off Your Thinking Cap

Benchmarking can be deceptive if I let the computer do the thinking for me. The Benchmark module can spit out numbers all day long, but if I don’t think about what I’m doing and what those numbers actually mean, they aren’t useful. They may even lead me to believe something that isn’t true, and I have a nice example from my personal experience of mistrusting a benchmark.

Part of Stonehenge’s Intermediate Perl course covers the Schwartzian Transform, which uses a cached sort-key to avoid duplicating work during a sort. The Schwartzian Transform should be faster, especially for more elements and more complicated sort-key computations. We covered this in Chapter 9 of Intermediate Perl.

In one of the course exercises, to prove to our students that the transform actually boosts performance, we ask them to sort a bunch of filenames in order of their modification date. Looking up the modification time is an expensive operation, especially when I have to do it N*log(N) times. Since we got the answer we wanted, we didn’t investigate as fully as we should have.

The answer we used to give in the course materials was not the best answer. It is short so it fits on one slide, but it makes things seem worse than they really are. The Schwartzian Transform comes out ahead, as it should, but I always thought it should be faster.

Our example used Benchmark’s timethese to compare two methods to sort filenames by their modification age. The “Ordinary” sort computes the file modification age, -M $a, every time it needs to make a comparison. The “Schwartzian” method uses the Schwartzian Transform to compute the modification age once per file and store it with the filename. It’s a cached-key sort:

use Benchmark qw{ timethese };
timethese( -2, {
        Ordinary    =>
                q{ my @results = sort { -M $a <=> -M $b } glob "/bin/*"; },
        Schwartzian =>
                q{ map $_->[0], sort { $a->[1] <=> $b->[1] } map [$_, -M], glob "/bin/*"; },
        });

This code has a number of problems. If I am going to compare two things, they need to be as alike as I can make them. Notice that in the “Ordinary” case I assign to @results and in the “Schwartzian” case I use map() in a void context. They do different things: one sorts and stores, and one just sorts. To compare them, they need to produce the same thing. In this case, they both need to store their result.

Also, I need to isolate the parts that are different and abstract the parts that are the same. In each code string, I do a glob(), which I already know is an expensive operation. The glob() taints the results because it adds to the time for the two sorts of, um, sorts.

During one class, while the students were doing their lab exercises, I did my own homework by rewriting our benchmark following the same process I should in any benchmark.

I broke up the task into parts and timed the different bits to see how they impact the overall task. I identified three major parts to benchmark: creating a list of files, sorting the files, and assigning the sorted list. I want to time each of those individually, and I also want to time the bigger task. This seems like such a simple task, comparing two bits of code, but I can mess up in several ways if I’m not careful.

I also want to see how much the numbers improve from the example we have in the course slides, so I use the original code strings, too. I try a bunch of different snippets to see how each part of the task contributes to the final numbers. How much of it comes from the list assignment, or from the filename generation through glob()? I build up a bunch of code strings from the various common parts.

First, I create some package variables. Benchmark turns my code strings into subroutines, and I want those subroutines to find these variables. They have to be global (package) variables. Although I know Benchmark puts these subroutines in the main:: package, I use L::*, which is short for Local. It’s not important that I do it in this particular way so much as that I abstract the common parts so they have as little effect as possible on the results.

The $L::glob variable is just the pattern I want glob to use, and I get that from @ARGV so I can run this program over different directories to see how the times change with different numbers of files. I specify it once and use it everywhere I use glob(). That way, every code string gets the same list of files. I expect the Schwartzian Transform to get better and better as the number of files increases.

I also want to run some code strings that don’t use glob(), so I pre-glob the directory and store the list in @L::files. I think glob() is going to significantly affect the times, so I want to see the results with and without it.

The $code anonymous hash has the code strings. I want to test the pieces as well as the whole thing, so I start off with control strings to assign the list of files to a variable and to run a glob(). Benchmark also runs an empty subroutine behind the scenes so it can adjust its time for that overhead too. I expect the “assign” times to be insignificant and the glob() times to be a big deal. At the outset, I suspect the glob() may be as much as a third of the time of the benchmarks, but that’s just a guess.

The next set of code strings measure the sort. The sort_names string tries it in void context, and the sort_names_assign does the same thing but assigns its result to an array. I expect a measurable difference, and the difference to be the same as the time for the assign string.

Then I try the original code strings from our exercise example, and call that ordinary_orig. That one uses a glob(), which I think inflates the time significantly. The ordinary_mod string uses the list of files in @L::files, which is the same thing as the glob() without the glob(). I expect these two to differ by the time of the glob code string.

The last set of strings compare three things. The schwartz_orig string is the one I started with. In schwartz_orig_assign, I fix that to assign to an array, just like I did with the other original code string. If I want to compare them, they have to do the same thing. The final code string, schwartz_mod, gets rid of the glob():

#!/usr/bin/perl
# schwartzian-benchmark.pl
use strict;
use Benchmark;

$L::glob = $ARGV[0];
@L::files = glob $L::glob;

print "Testing with " . @L::files . " files\n";

my $transform = q|map $_->[0], sort { $a->[1] <=> $b->[1] } map [ $_, -M ]|;
my $sort      = q|sort { -M $a <=> -M $b }|;

my $code = {
        assign               =>  q| my @r = @L::files |,
        'glob'               =>  q| my @files = glob $L::glob |,
        sort_names           =>  q| sort { $a cmp $b } @L::files |,
        sort_names_assign    =>  q| my @r = sort { $a cmp $b } @L::files |,
        sort_times_assign    => qq| my \@r = $sort \@L::files |,

        ordinary_orig        => qq| my \@r = $sort glob \$L::glob |,
        ordinary_mod         => qq| my \@r = $sort \@L::files |,
        schwartz_orig        => qq| $transform, glob \$L::glob |,
        schwartz_orig_assign => qq| my \@r = $transform, glob \$L::glob |,
        schwartz_mod         => qq| my \@r = $transform, \@L::files |,
};

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
print "Timing for 2 CPU seconds...\n";
timethese( -2, $code );

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
my $iterations = 1_000;
print "\n", "-" x 73, "\n\n";
print "Timing for $iterations iterations\n";

timethese( $iterations, $code );

The Benchmark module provides the report, which I reformatted to make it a bit easier to read (so some of the output is missing and some lines are shorter). The results are not surprising, although I like to show the students that they didn’t waste an hour listening to me talk about how wonderful the transform is:

$ perl benchmark

Testing with 380 files Timing for 2 CPU seconds...
Benchmark: running assign, glob, ordinary_mod, ordinary_orig,
        schwartz_mod, schwartz_orig, schwartz_orig_assign, sort_names,
        sort_names_assign for at least 2 + CPU seconds...
assign: (2.03 usr + 0.00 sys = 2.03 CPU) (n= 6063)
glob: (0.81 usr + 1.27 sys = 2.08 CPU) (n= 372)
ordinary_mod: (0.46 usr + 1.70 sys = 2.16 CPU) (n= 80)
ordinary_orig: (0.51 usr + 1.64 sys = 2.15 CPU) (n= 66)
schwartz_mod: (1.54 usr + 0.51 sys = 2.05 CPU) (n= 271)
schwartz_orig: (1.06 usr + 1.03 sys = 2.09 CPU) (n= 174)
schwartz_orig_assign: (1.20 usr + 0.87 sys = 2.07 CPU) (n= 156)
sort_names: (2.09 usr + 0.01 sys = 2.10 CPU) (n=3595626)
sort_names_assign: (2.16 usr + 0.00 sys = 2.16 CPU) (n= 5698)

-------------------------------------------------------------------------

Timing for 1000 iterations Benchmark: timing 1000 iterations of assign,
        glob, ordinary_mod, ordinary_orig, schwartz_mod, schwartz_orig,
        schwartz_orig_assign, sort_names, sort_names_assign ...
assign: 1 secs ( 0.33 usr + 0.00 sys = 0.33 CPU)
glob: 6 secs ( 2.31 usr + 3.30 sys = 5.61 CPU)
ordinary_mod: 28 secs ( 5.57 usr + 21.49 sys = 27.06 CPU)
ordinary_orig: 34 secs ( 7.86 usr + 24.74 sys = 32.60 CPU)
schwartz_mod: 8 secs ( 5.12 usr + 2.47 sys = 7.59 CPU)
schwartz_orig: 12 secs ( 6.63 usr + 5.52 sys = 12.15 CPU)
schwartz_orig_assign: 14 secs ( 7.76 usr + 5.41 sys = 13.17 CPU)
sort_names: 0 secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
sort_names_assign: 0 secs ( 0.39 usr + 0.00 sys = 0.39 CPU)

The sort_names result stands out. It ran almost two million times a second. It also doesn’t do anything since it’s in a void context. It runs really fast, and it runs just as fast no matter what I put in the sort() block. A sort() in void context will always be the fastest. The difference between the sort() and the map() in void context is not as pronounced in schwartz_orig and schwartz_orig_assign because only the last map is in void context. Both still have the rightmost map() and the sort() to compute before it can optimize for void context. There is an approximately 10 percent difference in the number of extra iterations the schwartz_orig can go through, so the missing assignment gave it an apparent but unwarranted boost in our original example.

I like to look at the second set of results for the comparisons, and use the wallclock seconds even though they are not as exact as the CPU seconds. Remember that the CPU times are only measuring time spent in the CPU, and that I’m really doing a lot of filesystem stuff here. The CPU times aren’t any more accurate than the wallclock times.

The glob code string took about 6 seconds, and the schwartz_orig_assign code string took 14 seconds. If I subtract those extra 6 seconds from the 14, I get the wallclock time for schwartz_mod, just like I expected. That’s over a third of the time! The ordinary_* times drop 6 seconds, too, but from 34 to 28 seconds, so the percent difference is not as alarming.

I try the same benchmark with more and more files, and I should see the Schwartzian Transform doing even better as the number of files grow. For the rest of the comparisons, I’ll use the actual CPU time since the round-off error is a lot higher now.

873 files

I’ll try 873 files because I have a directory with that many files in it. Notice that the glob() still has a significant effect on the times and that the original transform that was in the void context is still shaving off about 10 percent of the real time. The quotient between ordinary_mod and schwartz_mod is 73 / 20 = 3.7, which is a little bit higher than before. That’s much better than the 2.9 I get for ordinary_orig and schwartz_orig:

Benchmark: timing 1000 iterations of glob, ordinary_mod, schwartz_mod...
glob: 14 secs ( 6.28 usr + 8.00 sys = 14.28 CPU)
ordinary_mod: 73 secs (14.25 usr + 57.05 sys = 71.30 CPU)
ordinary_orig: 93 secs (20.83 usr + 66.14 sys = 86.97 CPU)
schwartz_mod: 20 secs (14.06 usr + 5.52 sys = 19.58 CPU)
schwartz_orig: 32 secs (17.38 usr + 13.59 sys = 30.97 CPU)
schwartz_orig_assign: 34 secs (19.95 usr + 13.60 sys = 33.55 CPU)

3162 files

Idle CPUs are wasted CPUs, but I think I’d rather have an idle CPU instead of one doing this benchmark again. My disk was spinning quite a bit as I ran it. The quotient between ordinary_mod and schwartz_mod is 675 / 151 = 4.5, so the Schwarzian Transform is doing even better, but ordinary_orig and schwartz_orig give me the ratio 2.8, less than before, so the incorrect benchmark has the gap closing. That’s not what should be happening!

Look at the huge penalty from the glob()! Now the glob() takes almost as much time as the transform itself. If I stuck with the original solution, students might think that the transform wasn’t so hot:

Benchmark: timing 1000 iterations of glob, ordinary_mod, schwartz_mod...
glob: 148 secs ( 31.26 usr + 102.59 sys = 133.85 CPU)
ordinary_mod: 675 secs ( 86.64 usr + 517.19 sys = 603.83 CPU)
ordinary_orig: 825 secs (116.55 usr + 617.62 sys = 734.17 CPU)
schwartz_mod: 151 secs ( 68.88 usr + 67.32 sys = 136.20 CPU)
schwartz_orig: 297 secs ( 89.33 usr + 174.51 sys = 263.84 CPU)
schwartz_orig_assign: 294 secs ( 96.68 usr + 168.76 sys = 265.44 CPU)

Memory Use

When a programmer talks about benchmarking, she’s probably talking about speed. After all, that’s what the Benchmark Perl module measure and what most articles on the subject discuss. Time is an easy thing to measure, so it’s understandable, though not necessarily right, that people measure what they can. Sometimes time is not the major constraint, but something else, such as memory use, is causing the problem.

The perldebguts documentation says:

There is a saying that to estimate memory usage of Perl, assume a reasonable algorithm for memory allocation, multiply that estimate by 10, and while you still may miss the mark, at least you won’t be quite so astonished.

Perl trades memory for processing speed. Instead of doing a lot of computation, Perl does a lot of lookup. Higher level languages handle memory management so the developer can think more about the task at hand than about getting more memory, releasing memory, or creating memory management bugs.^[30]

This ease of use comes at an expense, though. Since I don’t control the memory and Perl doesn’t know what I plan to do ahead of time, Perl has to guess. When Perl needs more memory, it grabs a big chunk of it. If I need memory now, I’ll probably need more later too, so Perl gets more than I need immediately. If I watch the memory use of my program carefully, I’ll see it jump in big amounts, stay that way for a bit, then jump again. Perl doesn’t give memory back to the operating system, either. It needed the memory before, so it might need it again. It tries to reuse the space it doesn’t need anymore, but it’s going to keep what it’s got.

Also, Perl is built on top of C, but it doesn’t have C’s data types. Perl has scalars, arrays, hashes, and a couple of others. Perl doesn’t expose the actual storage to me, so I don’t have to think about it. Not only that, but Perl has to deal with context. Are those data strings, or numbers, or both? Where, in memory, are all of the elements of the array? What other variables does this thing reference?

That’s the long way to say that a number in Perl is more than a number. It’s really a movie star with a big entourage. It may be a 32-bit integer, but it’s really 12 bytes. The Devel::Peek module lets me see what’s going on by letting me inspect the variable’s data structure to see how Perl stores it:

#!/usr/bin/perl

use Devel::Peek;

my $a;

print_message( "Before I do anything" );
Dump( $a );

print_message( "After I assign a string" );
$a = '123456789';
Dump( $a ),

print_message( "After I use it as a number" );
$b = $a + 1;
Dump( $a );

sub print_message
        {
        print STDERR "\n", "-" x 50,
                "\n$_[0]\n", "-" x 50, "\n"
        }

The output shows me what Perl is tracking for that scalar at each point in the program. When I first create the variable, it doesn’t have a value. I can see that Perl created the scalar (in internals parlance, the SV, or “scalar value”), it has a reference count of 1, and that it has some flags set. The SV doesn’t have anything in it (that’s the NULL(0x0)), but it has an address, 0x1808248, because the scalar infrastructure is set up and ready to go when I’m ready to give it a value.

When I assign a string to $a, it has more flags set and now has a PV, a “pointer value,” which really means it’s just a string (or char * for you C people). Now the scalar value points to the string data.

When I use this scalar as a number for the first time, Perl has to convert the string to a number. Once it does that, it stores the number value too, turning my scalar into a PVIV, meaning that it has a pointer value and an integer value. Perl sets more flags to indicate that it’s done the conversion and it has both values. Next time it can access the number directly:

--------------------------------------------------
Before I do anything
--------------------------------------------------
SV = NULL(0x0) at 0x1808248
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY)

--------------------------------------------------
After I assign a string
--------------------------------------------------
SV = PV(0x1800908) at 0x1808248
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  PV = 0x301c10 "123456789"\0
  CUR = 9
  LEN = 10

--------------------------------------------------
After I use it as a number
--------------------------------------------------
SV = PVIV(0x1800c20) at 0x1808248
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,IOK,POK,pIOK,pPOK)
  IV = 123456789
  PV = 0x301c10 "123456789"\0
  CUR = 9
  LEN = 10

Just from that I can see that Perl is doing a lot of work. Each Perl variable has some overhead even if it doesn’t have a defined value. That’s okay because Perl’s are more useful for it.

The Devel::Size module can tell me how much memory my variable takes up. I have to remember, though, that the actual memory is probably a little bit more since Perl has to align the low-level values at the appropriate byte boundaries. It can’t just store a value starting anywhere it likes:

#!/usr/bin/perl

use Devel::Size qw(size);

my $n;

print_message( "Before I do anything" );
print "Size is ", size( \$n );

print_message( "After I assign a string" );
$n = '1';
print "Size is ", size( \$n );

print_message( "After I use it as a number" );

my $m = $n + 1;
print "Size is ", size( \$n );

sub print_message { print "\n", "-" x 50, "\n$_[0]\n", "-" x 50, "\n" }

I see that even before I do anything, my scalar $n takes up 12 bytes, at least. When I assign it a string, the size of the scalar is larger, and by more than just the number of characters in the string. Perl tacks on a null byte to terminate the string and might have some extra space allocated in case the string gets bigger. When I use the string as a number, Perl stores the numeric version too, so the scalar gets even larger. Every one of these things can use a bit more memory than I expect:

--------------------------------------------------
Before I do anything
--------------------------------------------------
Size is 12

--------------------------------------------------
After I assign a string
--------------------------------------------------
Size is 26

--------------------------------------------------
After I use it as a number
--------------------------------------------------
Size is 31

What about references, which are also scalars? They only need to know where to find the value, but they don’t store values themselves. They stay the same size even when the values change. The size of a reference doesn’t change. I have to be careful with Devel::Size, though. If I give it a reference, it finds the size of the thing at which the reference points. That’s a good thing, as I’ll show when I try it with arrays or hashes. However, if I have a reference pointing at a reference, the size of that second reference is the size of the thing at which it points, which is just a reference:

#!/usr/bin/perl

use LWP::Simple;
use Devel::Size qw(size);

# watch out! This is 50+ MB big!
my $data = get( "http://www.cpan.org/src/stable.tar.gz" );

print "The size of the data is " , size( $data ), "\n";

my $ref = \$data;

print "The size of the reference is " , size( $ref ), "\n";

my $ref2 = \$ref;

print "The size of the second reference is " , size( $ref2 ), "\n";

The output shows that the second reference is just 16 bytes. It doesn’t include all of the data stored in the ultimate scalar. I’ll show in a moment why I need to know that, but I have to look at Perl’s containers first:

The size of the data is 12829217
The size of the reference is 12829217
The size of the second reference is 16

The situation for Perl’s containers is different. Arrays are collections of scalars, and hashes have scalar keys and scalar values. Those scalars can be the normal variety that hold values or they can be references. The size function from Devel::Size tells us the size of the data structure. Remember, references may point to big values, but they don’t take up that much space themselves:

#!/usr/bin/perl

use Devel::Size qw(size);

my @array = ( 1 ) x 500;

print "The size of the array is ", size( \@array ), "\n";

I can see how much space the array takes up. The Devel::Size documentation is careful to note that this doesn’t count the size of the things in the array, just the size of the array. Notice that the size of my 500-element array is much larger than 500 times the 16 bytes my individual scalars used:

The size of the array is 2052

That number isn’t counting the contents though. The array takes up the same size no matter what the scalars hold:

#!/usr/bin/perl

use Devel::Size qw(size);

my $data   = '-' x 500;
print "The size of the scalar is ", size( $data ), "\n";

my @array = ( $data ) x 500;
print "The size of the array is ", size( \@array ), "\n";

I created a scalar with 500 characters, and the entire scalar including the overhead takes up 525 bytes. The array takes up the same space as it did previously:

The size of the scalar is 525
The size of the array is 2052

Devel::Size has a fix for this. To get around this, I need to look at each of the scalars in the container and find their sizes. The reference values may point to other containers, which may have more references. My array might look like it’s really small until I try to make a deep copy, store it, or anything else where I have to get all the data in one place, reference or not:

#!/usr/bin/perl

use Devel::Size qw(size total_size);

my $data   = '-' x 500;
print "The       size of the scalar is ", size( $data ), "\n";
print "The total size of the scalar is ", total_size( $data ), "\n";

print "\n";

my @array = ( $data ) x 500;
print "The       size of the array is ", size( \@array ), "\n";
print "The total size of the array is ", total_size( \@array ), "\n";

Using total_size, the scalar size stays the same, and the array size now includes all the scalar sizes. The number, 264,552, is 500 times 525, the aggregate size of the scalars added to 2,052, the array size:

The       size of the scalar is 525
The total size of the scalar is 525

The       size of the array is 2052
The total size of the array is 264552

I have to remember what this number actually is, though. It’s just the aggregate size of all the data to which the array eventually points. If I did this for all of my data structures, I do not get the program memory size because those structures might contain references to the same data.

The perlbench Tool

The same code can perform differently on different perl binaries, which might differ in their compilation options, compilers used, features included, and so on. For instance, threaded versions of Perl are slightly slower, as are shared library versions. It’s not necessarily bad to be slower if you want the other benefits, but you don’t always get to choose beforehand. For instance, the stock Perl on some Linux distributions is compiled for threads. If you think you might get a speedup with a nonthreaded interpreter, find out before you reconfigure your system!

To compare different perl interpreters, Gisle Aas wrote perlbench. I give it the paths of the interpreters I want to test, and it runs several tests on them, producing a report. The perlbench distribution comes with perlbench-run which, given the locations of the perl interpreters I want to test, runs a series of benchmarks on each of them. The program normalizes the numbers to the time for the first interpreter I specify:

perlbench-run /usr/local/bin/perl5*

The output first shows the details for each interpreter I’m testing and assigns them letters that correspond to a column in the table that it’s about to output. Especially interesting are the ccflags information. In this run, I’m using a perl-5.8.7 I compiled with -DDEBUGGING_MSTATS, for instance. Also interesting is the compiler information. It looks like I’ve got a pretty old version of gcc. That might be a good or bad thing. Different versions, or even different compilers, do better or worse jobs optimizing the code. These numbers only have relative meaning on the same machine:

A) perl-5.6.1
        version     = 5.006001
        path        = /usr/local/bin/perl5.6.1
        ccflags     = -fno-strict-aliasing -I/usr/local/include
        gccversion  = 2.95.2 19991024 (release)
        optimize    = -O
        usemymalloc = n

B) perl-5.8.0
        version     = 5.008
        path        = /usr/local/bin/perl5.8.0
        ccflags     = -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing
        gccversion  = 2.95.2 19991024 (release)
        optimize    = -O
        usemymalloc = n

C) perl-5.8.7
        version     = 5.008007
        path        = /usr/local/bin/perl5.8.7
        ccflags     = -DDEBUGGING_MSTATS
        gccversion  = 2.95.4 20020320 [FreeBSD]
        optimize    = -g
        usemymalloc = y

D) perl-5.8.8
        version     = 5.008008
        path        = /usr/local/bin/perl5.8.8
        ccflags     = -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing
        gccversion  = 2.95.4 20020320 [FreeBSD]
        optimize    = -O
        usemymalloc = n

After perlbench-run reports the details of the interpreter, it runs a series of Perl programs with each of the interpreters. It measures the time to execute, much like Benchmark’s timethese. Once it tries the program with all of the interpreters, it normalizes the results so that the first interpreter (that’s the one labeled with “A”) is 100. Lower numbers in the other column mean that interpreter is slower for that test. Higher numbers (they can be above 100) mean that interpreter is faster for that test. The number only has meaning for that test, and I can’t compare them to a different test, even in the same run.

I’ve cut out some of the output from these results, but this chart gives you the flavor of the comparisons. Interpreter C, the one compiled with -DDEBUGGING_MSTATS, is consistently slower than all of the other interpreters. For the other tests, sometimes Perl 5.6.1 is faster and sometimes Perl 5.8.8 is faster. That’s not a general observation since it only applies to the ones I’ve compiled. Overall, it looks as if my Perl 5.6.1 is the fastest. That still doesn’t mean I should choose it, though, because for a slight penalty I get all the nice features of Perl 5.8:

                                  A       B       C       D
                                 ---     ---     ---     ---
        arith/mixed              100      85      73      79
        arith/trig               100      87      82      81
        array/copy               100      99      81      92
        array/foreach            100      93      87      99
        array/shift              100     100      94      91
        array/sort-num           100      89      92     151
        array/sort               100      95      80      94
        call/0arg                100     107      79      91
        call/1arg                100      92      69      78
        call/wantarray           100      95      76      80
        hash/copy                100     130      94     124
        hash/each                100     119      90     110
        hash/foreach-sort        100     103      78     102
        loop/for-c               100     102      88     104
        loop/for-range-const     100     101      94     106
        loop/for-range           100     100      94     104
        re/const                 100      92      81      88
        string/base64            100      86      67      72
        string/htmlparser        100      91      75      74
        string/tr                100     105      51     111

        AVERAGE                  100      97      80      91

        Results saved in file:///home/brian/perlbench-0.93/benchres-002/index.html

If I have something special to test, I can add my own test files. Most of the infrastructure is already in place. The README from the perlbench distribution gives the basic format of a test file. I create my test and put it in perlbench’s benchmark directory. The distribution gives an example file:

# Name: My name goes here
# Require: 4

require 'benchlib.pl';

# YOUR SETUP CODE HERE
$a = 0;

&runtest(100, <<'ENDTEST');
        # YOUR TESTING CODE HERE
ENDTEST

Summary

Benchmarking is a tricky subject. It involves a lot of numbers and requires a good understanding of what’s actually going on. Not only do I have to look at my Perl program, but I should consider other factors, such as my choice in operating system, the Perl interpreter I’m using and how I compiled it, and anything else that interacts with my program. It’s not all about speed, either. I might want to compare the memory use of two approaches, or see which one takes up less bandwidth. Different situations have different constraints. No matter what I’m doing, I need to do my best to find out what’s really going on before I make any conclusions about how to make it better.

Mastering Perl by brian d foy

Chapter 6. Benchmarking Perl

Benchmarking Theory

Benchmarking Time

Comparing Code

Don’t Turn Off Your Thinking Cap

Memory Use

The perlbench Tool

Summary

Further Reading

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly