Tony Hoare’s famous quote—“Premature optimization is the root of all evil”—usually doesn’t come with its setup: “We should forget about small efficiencies, say about 97% of the time.” That is, don’t sweat the small stuff until you need to. In this chapter, I show how I can look into my Perl programs to see where the slow parts are. Before I start working to improve the performance of my program, I should check to see what the program is actually doing. Once I know where the slow parts are, I concentrate on those.
The term benchmark comes from surveyors. They create a physical mark in something to denote a known elevation and use that mark to determine other elevations. Those computed elevations can only be right if the original mark is right. Even if the original mark started off right, maybe it changed because it sunk into the ground, the ground moved because of an earthquake, or global warming redefined the ultimate benchmark we call sea level.[27]Benchmarks are comparisons, not absolutes.
For computers, a benchmark compares the performance of one system against another. They measure in many dimensions, including time to completion, resource use, network activity, or memory use. Several tools already exist for measuring the parts outside of Perl so I won’t cover those here. I want to look inside Perl to see what I can find. I want to know if one bit of code is faster or uses less memory.
Measuring things and extracting numbers is easy, and it’s often easy for us to believe the numbers that computers give us. This makes benchmarking dangerous. Unlike those surveyors, we can’t stand on a hill and know if we are higher or lower than the next hill by just looking. We have to carefully consider not only the numbers that we get from benchmarks, but the method we use to generate the numbers.
Benchmarking isn’t as popular as it used to be. The speed and storage of computers and the bandwidth of networks are not as limiting as they used to be, so we don’t feel like we have to work hard to conserve them. We also don’t have to pay (as in money, literally) for CPU cycles (in most cases), so we don’t care how many we actually use. At least, we don’t care as much as programmers used to care. After all, you’re using Perl, aren’t you?
Any measurement comes with risk. If I don’t understand what I’m measuring, what affects the measurement, or what the numbers actually mean, I can easily misinterpret the results. If I’m not careful about how I measure things, my numbers may be meaningless. I can let the computer do the benchmarking work for me, but I shouldn’t let it do the thinking for me.
A Perl program doesn’t run on its own. It depends on a perl
interpreter, an operating system, and
hardware. Each of those things depends on other things. Even if I use the
same machine, different perl
interpreters, even of the same version of Perl, may give different
results. I could have compiled them with different C compilers that have
different levels of optimization, I could have included different features
in one interpreter, and so on. I’ll talk about this more toward the end of
the chapter when I discuss perlbench
.
You probably don’t have to imagine a situation where you develop on one platform but deploy on another. I get to visit many companies in my travels as a consultant with Stonehenge, so I’ve been able to see a lot of different setups. Often, teams develop on one system that only they use, and then deploy the result to a busy server that has a different version of Perl, a different version of the operating system, and a completely different load profile. What was quite speedy on a lightly used machine becomes unbearably slow when people start to use it. A good example of this is CGI programs, which become quite slow with increased load, versus speedy mod_perl programs, which scale quite nicely.
Any benchmark only applies to its situation. Extrapolating my
results might not get me in trouble, but they aren’t really valid either.
The only way for me to really know what will happen in a particular
situation is to test that situation. Along with my numbers, I have to
report the details. It’s not enough to say, for instance, that I’m writing
this on a Powerbook G4 running Mac OS 10.4.4. I have to tell you the
details of my perl
interpreter, how I
compiled it (that’s just perl -V
), and
how I’ve tuned my operating system.
Also, I can’t measure something without interacting with it, and
that interaction changes the situation. If I want to watch the details of
Perl’s memory management, for instance, I can compile Perl with -DDEBUGGING_MSTATS
, but then it’s not the same Perl interpreter. Although I can
now watch the memory, I’ve probably slowed the entire program down (and I
verify that at the end of this chapter when I show perlbench
). If I add code to time the program, I
have to execute that code, which means my program takes longer. In any
case, I might have to use additional modules, which means that Perl has to
find, load, and compile more code.
To measure the time it takes my program to run, I could just look at the clock on the wall when I start the program and look again when the program finishes. That’s the simplest way, and the most naive, too. This method might work in extreme circumstances. If I can reduce the run time of my program from an entire workday to a couple of minutes, then I don’t care that the wallclock method might be a bit inaccurate.
I don’t have to really look at my watch, though. I can time my program directly in my program if I like:
#!/usr/bin/perl my $start = time; #... the meat of my program my $end = time; print "The total time is ", $end - $start;
For a short-running program, this method only tests a portion of the
runtime. What about all that time Perl spent compiling the code? If I used
a lot of modules, a significant part of the time the whole process takes
might be in the parts before Perl even starts running the program.
Jean-Louis Leroy wrote an article for Perl.com[28]about slow startup times in a Perl FTP program because Perl
had to look through 23 different directories to find everything
Net::FTP
needed to load. The runtime portion is still pretty speedy, but
the startup time was relatively long. Remember that Perl has to compile
the program every time I run it (forgetting about things like mod_perl for
the moment). If I use many modules, I make a lot of work for Perl to find
them and compile them every time I run my program.
If I want to time the whole process, compile time and runtime, I can create a wrapper around the program to do the wallclock timing. I could take this number and compare it to the runtime numbers to estimate the compilation times:
#!/usr/bin/perl my $start = time; system( "@ARGV" ); my $end = time; printf "The whole time was %d seconds", $end - $start;
The wallclock method breaks down, though, because the operating system can switch between tasks, or even run different tasks at the same time. I can’t tell how much time the computer worked on my program by only looking at my watch. The situation is even worse when my program has to wait for resources that might be busy or for network latency. I can’t really blame my program in those cases.
The time
program (not the Perl
built-in) that comes with most unix-like systems solves this by reporting
only the time that the operating system thinks about my program. Your
particular shell may even have a built-in command for it.[29]
From the command line, I tell the time
command what it should measure. It runs the
command and reports its results. It breaks down the runtime down by the
real time, the user time, and the system time. The real time is the
wallclock time. The other two deal with how the operating system divides
tasks between the system and the my process. Mostly I don’t care about
that distinction and only their sum matters to me.
When I time the sleep
program (not the
Perl built-in), the real time is the time I told it to sleep, but since
that’s all that program does, the user and system times are minuscule. The
output for your particular version of time
may be different:
$ time sleep 5 real 0m5.094s user 0m0.002s sys 0m0.011s
Behind the scenes, the time
program just uses the times
function
from the standard C library, and that carries along accounting information
(although we’re fortunate that we don’t have to pay for clock cycles
anymore). The times
Perl built-in does
the same thing. In list context, it returns four times: the total user and
system time, and the user and system time for the children of the process.
I take the end times and subtract the starting times to get the real
times:
#!/usr/bin/perl use Benchmark; my @start = times; #... the meat of my program my @end = times; my @diffs = map { $end[$_] - $start[$_] } 0 .. $#end; print "The total time is @diffs";
I don’t have to do those calculations myself, though, because
the Benchmark
module, which comes with Perl,
already does it for me. Again, this approach only measures the
runtime:
#!/usr/bin/perl use Benchmark; my $start = Benchmark->new; #... the meat of my program my $end = Benchmark->new; my $diff = timediff( $t1, $t0 ); print "My program took: " . timestr( $diff ) . "\n"; ( $real, $child_user, $child_system ) = @$diff[0,3,4]; # I'm pretty sure this is POSIX format printf STDERR "\nreal\t%.3f\nuser\t%.3f\nsys\t%.3f\n", $real, $child_user, $child_system;
The output looks like the times
output I showed previously, but now it comes completely from within my
Perl program and just for the parts of the program inside of the calls to
Benchmark->new
. Instead of timing
the entire program, I can focus on the part I want to examine.
This is almost the same thing David Kulp did to create the Perl Power Tools version of time
. Take a benchmark, run the command of
interest using system
(so those are the
children times), and then take another benchmark once system
returns. Since this version of time
is pure Perl, it runs anywhere that Perl
runs:
#!/usr/bin/perl use Benchmark; $t0 = Benchmark->new; $rc = system( @ARGV ); $t1 = Benchmark->new; $diffs = timediff( $t1, $t0 ); printf STDERR "\nreal %.2f\nuser %.2f\nsys %.2f\n", @$diffs[0,3,4]; $rc &= 0xffff; if ($rc == 0xff00) { exit 127; } else { exit ($rc >> 8); }
There’s a big problem with measuring CPU times and comparing them to program perfomance: they only measure the time my program used the CPU. It doesn’t include the time that my program waits to get input, to send output, or to get control of some other resource. Those times might be much more important that the CPU time.
Benchmarks by themselves aren’t very useful. I file them under the heading of “decision support.” I might be able to use them to decide that I need to change a program to improve a number, but the number itself doesn’t tell me what to do. Sure, I know how long it takes to run my program, but it doesn’t tell me if I can make it any faster. I need to compare one implementation to another.
I could compare entire programs to each other, but that’s not very useful. If I’m trying to speed up a program, for instance, I’m going to change the parts that I think are slow. Most of the other parts will be the same, and the time to run all of those same parts end up in the total time. I really just want to compare the bits that are different. The times for the rest of the code skews the results, so I need to isolate the parts that I want to compare.
If I extract the different parts, I can create small programs with
just those. Most of the time the sample program takes to run then only
applies to the interesting bits. I’ll talk more about that later, but as I
go through this next section, remember that anything I do has some
overhead and every measurement changes the situation a bit, so I should
think about the numbers before I accept them. For now, I’ll go back to the
Benchmark
module.
If I want to compare two small bits of code instead of entire
programs, I can use some of the functions from
Benchmark
. I can compare either by running a certain
number of iterations and comparing the total time, or the inverse of that,
a total time and comparing the total number of iterations.
In the timethese
function from Benchmark
, I give
it a number of iterations as the first argument. The second argument is an
anonymous hash where the keys are labels I give the snippets and the hash
values represent the code I want to compare, in this case as string values
that Perl will eval
. In this sample
program, I want to compare the speed of opendir
and glob
for getting a list of files:
#!/usr/bin/perl use Benchmark; my $iterations = 10_000; timethese( $iterations, { 'Opendir' => 'opendir my( $dh ), "."; my @f = readdir( $dh )', 'Glob' => 'my @f = glob("*")', } );
The timethese
function prints a
nice report that shows me the three times I discussed earlier:
$ perl dir-benchmark.pl Benchmark: timing 10000 iterations of Glob, Opendir... Glob: 6 wallclock secs ( 2.12 usr + 3.47 sys = 5.59 CPU) @ 1788.91/s (n=10000) Opendir: 3 wallclock secs ( 0.85 usr + 1.70 sys = 2.55 CPU) @ 3921.57/s (n=10000)
These aren’t “The Numbers,” though. People try to get away with running the measurement once. Try it again. Then again. The results vary a little bit every time you run it; certainly some of this is merely round-off error:
$ perl dir-benchmark.pl Benchmark: timing 10000 iterations of Glob, Opendir... Glob: 6 wallclock secs ( 2.10 usr + 3.47 sys = 5.57 CPU) @ 1795.33/s (n=10000) Opendir: 3 wallclock secs ( 0.86 usr + 1.70 sys = 2.56 CPU) @ 3906.25/s (n=10000) $ perl dir-benchmark.pl Benchmark: timing 10000 iterations of Glob, Opendir... Glob: 7 wallclock secs ( 2.11 usr + 3.51 sys = 5.62 CPU) @ 1779.36/s (n=10000) Opendir: 3 wallclock secs ( 0.87 usr + 1.71 sys = 2.58 CPU) @ 3875.97/s (n=10000) $ perl dir-benchmark.pl Benchmark: timing 10000 iterations of Glob, Opendir... Glob: 7 wallclock secs ( 2.11 usr + 3.47 sys = 5.58 CPU) @ 1792.11/s (n=10000) Opendir: 3 wallclock secs ( 0.85 usr + 1.69 sys = 2.54 CPU) @ 3937.01/s (n=10000)
Benchmarking can be deceptive if I let the computer do the thinking
for me. The Benchmark
module can spit out numbers
all day long, but if I don’t think about what I’m doing and what those
numbers actually mean, they aren’t useful. They may even lead me to
believe something that isn’t true, and I have a nice example from my
personal experience of mistrusting a benchmark.
Part of Stonehenge’s Intermediate Perl course covers the Schwartzian Transform, which uses a cached sort-key to avoid duplicating work during a sort. The Schwartzian Transform should be faster, especially for more elements and more complicated sort-key computations. We covered this in Chapter 9 of Intermediate Perl.
In one of the course exercises, to prove to our students that the
transform actually boosts performance, we ask them to sort a bunch of
filenames in order of their modification date. Looking up the modification
time is an expensive operation, especially when I have to do it N*log(N)
times. Since we got the answer we
wanted, we didn’t investigate as fully as we should have.
The answer we used to give in the course materials was not the best answer. It is short so it fits on one slide, but it makes things seem worse than they really are. The Schwartzian Transform comes out ahead, as it should, but I always thought it should be faster.
Our example used Benchmark
’s timethese
to compare two methods to sort
filenames by their modification age. The “Ordinary” sort computes the file
modification age, -M $a
, every time it
needs to make a comparison. The “Schwartzian” method uses the Schwartzian
Transform to compute the modification age once per file and store it with
the filename. It’s a cached-key sort:
use Benchmark qw{ timethese }; timethese( -2, { Ordinary => q{ my @results = sort { -M $a <=> -M $b } glob "/bin/*"; }, Schwartzian => q{ map $_->[0], sort { $a->[1] <=> $b->[1] } map [$_, -M], glob "/bin/*"; }, });
This code has a number of problems. If I am going to compare two
things, they need to be as alike as I can make them. Notice that in the
“Ordinary” case I assign to @results
and in the “Schwartzian” case I use map()
in a void context.
They do different things: one sorts and stores, and one just sorts. To
compare them, they need to produce the same thing. In this case, they both
need to store their result.
Also, I need to isolate the parts that are different and abstract
the parts that are the same. In each code string, I do a glob()
, which I already know is an expensive operation. The glob()
taints the results because it adds to the
time for the two sorts of, um, sorts.
During one class, while the students were doing their lab exercises, I did my own homework by rewriting our benchmark following the same process I should in any benchmark.
I broke up the task into parts and timed the different bits to see how they impact the overall task. I identified three major parts to benchmark: creating a list of files, sorting the files, and assigning the sorted list. I want to time each of those individually, and I also want to time the bigger task. This seems like such a simple task, comparing two bits of code, but I can mess up in several ways if I’m not careful.
I also want to see how much the numbers improve from the example we
have in the course slides, so I use the original code strings, too. I try
a bunch of different snippets to see how each part of the task contributes
to the final numbers. How much of it comes from the list assignment, or
from the filename generation through glob()
? I build up a bunch of code strings from
the various common parts.
First, I create some package variables. Benchmark
turns my code strings into subroutines, and I want those subroutines to
find these variables. They have to be global (package) variables. Although
I know Benchmark puts these subroutines in the main::
package, I use L::*
, which is short for Local
. It’s not important that I do it in this
particular way so much as that I abstract the common parts so they have as
little effect as possible on the results.
The $L::glob
variable is just the pattern I want glob to use, and I get that from
@ARGV
so I can run this program over
different directories to see how the times change with different numbers
of files. I specify it once and use it everywhere I use glob()
. That way, every code string gets the
same list of files. I expect the Schwartzian Transform to get better and
better as the number of files increases.
I also want to run some code strings that don’t use glob()
, so I pre-glob the directory and store
the list in @L::files
. I think glob()
is going to significantly affect the
times, so I want to see the results with and without it.
The $code
anonymous hash has the
code strings. I want to test the pieces as well as the whole thing, so I
start off with control strings to assign the list of files to a variable
and to run a glob()
.
Benchmark
also runs an empty subroutine behind the
scenes so it can adjust its time for that overhead too. I expect the
“assign” times to be insignificant and the glob()
times to be a big deal. At the outset, I
suspect the glob()
may be as much as a
third of the time of the benchmarks, but that’s just a guess.
The next set of code strings measure the sort. The
sort_names
string tries it in void context, and the
sort_names_assign
does the same thing but assigns its
result to an array. I expect a measurable difference, and the difference
to be the same as the time for the assign
string.
Then I try the original code strings from our exercise example, and
call that ordinary_orig
. That one uses a glob()
, which I think inflates the time
significantly. The ordinary_mod
string uses the list of
files in @L::files
, which is the same
thing as the glob()
without the
glob()
. I expect these two to differ by
the time of the glob
code string.
The last set of strings compare three things. The
schwartz_orig
string is the one I started with. In
schwartz_orig_assign
, I fix that to assign to an array,
just like I did with the other original code string. If I want to compare
them, they have to do the same thing. The final code string,
schwartz_mod
, gets rid of the glob()
:
#!/usr/bin/perl # schwartzian-benchmark.pl use strict; use Benchmark; $L::glob = $ARGV[0]; @L::files = glob $L::glob; print "Testing with " . @L::files . " files\n"; my $transform = q|map $_->[0], sort { $a->[1] <=> $b->[1] } map [ $_, -M ]|; my $sort = q|sort { -M $a <=> -M $b }|; my $code = { assign => q| my @r = @L::files |, 'glob' => q| my @files = glob $L::glob |, sort_names => q| sort { $a cmp $b } @L::files |, sort_names_assign => q| my @r = sort { $a cmp $b } @L::files |, sort_times_assign => qq| my \@r = $sort \@L::files |, ordinary_orig => qq| my \@r = $sort glob \$L::glob |, ordinary_mod => qq| my \@r = $sort \@L::files |, schwartz_orig => qq| $transform, glob \$L::glob |, schwartz_orig_assign => qq| my \@r = $transform, glob \$L::glob |, schwartz_mod => qq| my \@r = $transform, \@L::files |, }; # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # print "Timing for 2 CPU seconds...\n"; timethese( -2, $code ); # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # my $iterations = 1_000; print "\n", "-" x 73, "\n\n"; print "Timing for $iterations iterations\n"; timethese( $iterations, $code );
The Benchmark
module provides the report, which I
reformatted to make it a bit easier to read (so some of the output is
missing and some lines are shorter). The results are not surprising,
although I like to show the students that they didn’t waste an hour
listening to me talk about how wonderful the transform is:
$ perl benchmark Testing with 380 files Timing for 2 CPU seconds... Benchmark: running assign, glob, ordinary_mod, ordinary_orig, schwartz_mod, schwartz_orig, schwartz_orig_assign, sort_names, sort_names_assign for at least 2 + CPU seconds... assign: (2.03 usr + 0.00 sys = 2.03 CPU) (n= 6063) glob: (0.81 usr + 1.27 sys = 2.08 CPU) (n= 372) ordinary_mod: (0.46 usr + 1.70 sys = 2.16 CPU) (n= 80) ordinary_orig: (0.51 usr + 1.64 sys = 2.15 CPU) (n= 66) schwartz_mod: (1.54 usr + 0.51 sys = 2.05 CPU) (n= 271) schwartz_orig: (1.06 usr + 1.03 sys = 2.09 CPU) (n= 174) schwartz_orig_assign: (1.20 usr + 0.87 sys = 2.07 CPU) (n= 156) sort_names: (2.09 usr + 0.01 sys = 2.10 CPU) (n=3595626) sort_names_assign: (2.16 usr + 0.00 sys = 2.16 CPU) (n= 5698) ------------------------------------------------------------------------- Timing for 1000 iterations Benchmark: timing 1000 iterations of assign, glob, ordinary_mod, ordinary_orig, schwartz_mod, schwartz_orig, schwartz_orig_assign, sort_names, sort_names_assign ... assign: 1 secs ( 0.33 usr + 0.00 sys = 0.33 CPU) glob: 6 secs ( 2.31 usr + 3.30 sys = 5.61 CPU) ordinary_mod: 28 secs ( 5.57 usr + 21.49 sys = 27.06 CPU) ordinary_orig: 34 secs ( 7.86 usr + 24.74 sys = 32.60 CPU) schwartz_mod: 8 secs ( 5.12 usr + 2.47 sys = 7.59 CPU) schwartz_orig: 12 secs ( 6.63 usr + 5.52 sys = 12.15 CPU) schwartz_orig_assign: 14 secs ( 7.76 usr + 5.41 sys = 13.17 CPU) sort_names: 0 secs ( 0.00 usr + 0.00 sys = 0.00 CPU) sort_names_assign: 0 secs ( 0.39 usr + 0.00 sys = 0.39 CPU)
The sort_names
result stands out. It ran almost
two million times a second. It also doesn’t do anything since it’s in a
void context. It runs really fast, and it runs just as fast no matter what
I put in the sort()
block. A sort()
in void context will always be the
fastest. The difference between the sort()
and the map()
in void context is not as pronounced in
schwartz_orig
and
schwartz_orig_assign
because only the last map is in
void context. Both still have the rightmost map()
and the sort()
to compute before it can optimize for
void context. There is an approximately 10 percent difference in the
number of extra iterations the schwartz_orig
can go
through, so the missing assignment gave it an apparent but unwarranted
boost in our original example.
I like to look at the second set of results for the comparisons, and use the wallclock seconds even though they are not as exact as the CPU seconds. Remember that the CPU times are only measuring time spent in the CPU, and that I’m really doing a lot of filesystem stuff here. The CPU times aren’t any more accurate than the wallclock times.
The glob
code string took about 6 seconds, and
the schwartz_orig_assign
code string took 14 seconds.
If I subtract those extra 6 seconds from the 14, I get the wallclock time
for schwartz_mod
, just like I expected. That’s over a
third of the time! The ordinary_*
times drop 6 seconds, too, but from 34 to 28 seconds, so the percent
difference is not as alarming.
I try the same benchmark with more and more files, and I should see the Schwartzian Transform doing even better as the number of files grow. For the rest of the comparisons, I’ll use the actual CPU time since the round-off error is a lot higher now.
- 873 files
I’ll try 873 files because I have a directory with that many files in it. Notice that the
glob()
still has a significant effect on the times and that the original transform that was in the void context is still shaving off about 10 percent of the real time. The quotient betweenordinary_mod
andschwartz_mod
is 73 / 20 = 3.7, which is a little bit higher than before. That’s much better than the 2.9 I get forordinary_orig
andschwartz_orig
:Benchmark: timing 1000 iterations of glob, ordinary_mod, schwartz_mod... glob: 14 secs ( 6.28 usr + 8.00 sys = 14.28 CPU) ordinary_mod: 73 secs (14.25 usr + 57.05 sys = 71.30 CPU) ordinary_orig: 93 secs (20.83 usr + 66.14 sys = 86.97 CPU) schwartz_mod: 20 secs (14.06 usr + 5.52 sys = 19.58 CPU) schwartz_orig: 32 secs (17.38 usr + 13.59 sys = 30.97 CPU) schwartz_orig_assign: 34 secs (19.95 usr + 13.60 sys = 33.55 CPU)
- 3162 files
Idle CPUs are wasted CPUs, but I think I’d rather have an idle CPU instead of one doing this benchmark again. My disk was spinning quite a bit as I ran it. The quotient between
ordinary_mod
andschwartz_mod
is 675 / 151 = 4.5, so the Schwarzian Transform is doing even better, butordinary_orig
andschwartz_orig
give me the ratio 2.8, less than before, so the incorrect benchmark has the gap closing. That’s not what should be happening!Look at the huge penalty from the
glob()
! Now theglob()
takes almost as much time as the transform itself. If I stuck with the original solution, students might think that the transform wasn’t so hot:Benchmark: timing 1000 iterations of glob, ordinary_mod, schwartz_mod... glob: 148 secs ( 31.26 usr + 102.59 sys = 133.85 CPU) ordinary_mod: 675 secs ( 86.64 usr + 517.19 sys = 603.83 CPU) ordinary_orig: 825 secs (116.55 usr + 617.62 sys = 734.17 CPU) schwartz_mod: 151 secs ( 68.88 usr + 67.32 sys = 136.20 CPU) schwartz_orig: 297 secs ( 89.33 usr + 174.51 sys = 263.84 CPU) schwartz_orig_assign: 294 secs ( 96.68 usr + 168.76 sys = 265.44 CPU)
When a programmer talks about benchmarking, she’s probably talking
about speed. After all, that’s what the Benchmark
Perl module measure and what most articles on the subject discuss. Time
is an easy thing to measure, so it’s understandable, though not
necessarily right, that people measure what they can. Sometimes time is
not the major constraint, but something else, such as memory use, is
causing the problem.
The perldebguts documentation says:
There is a saying that to estimate memory usage of Perl, assume a reasonable algorithm for memory allocation, multiply that estimate by 10, and while you still may miss the mark, at least you won’t be quite so astonished.
Perl trades memory for processing speed. Instead of doing a lot of computation, Perl does a lot of lookup. Higher level languages handle memory management so the developer can think more about the task at hand than about getting more memory, releasing memory, or creating memory management bugs.[30]
This ease of use comes at an expense, though. Since I don’t control the memory and Perl doesn’t know what I plan to do ahead of time, Perl has to guess. When Perl needs more memory, it grabs a big chunk of it. If I need memory now, I’ll probably need more later too, so Perl gets more than I need immediately. If I watch the memory use of my program carefully, I’ll see it jump in big amounts, stay that way for a bit, then jump again. Perl doesn’t give memory back to the operating system, either. It needed the memory before, so it might need it again. It tries to reuse the space it doesn’t need anymore, but it’s going to keep what it’s got.
Also, Perl is built on top of C, but it doesn’t have C’s data types. Perl has scalars, arrays, hashes, and a couple of others. Perl doesn’t expose the actual storage to me, so I don’t have to think about it. Not only that, but Perl has to deal with context. Are those data strings, or numbers, or both? Where, in memory, are all of the elements of the array? What other variables does this thing reference?
That’s the long way to say that a number in Perl is more than a
number. It’s really a movie star with a big entourage. It may be a 32-bit
integer, but it’s really 12 bytes. The Devel::Peek
module lets me see what’s
going on by letting me inspect the variable’s data structure to see how
Perl stores it:
#!/usr/bin/perl use Devel::Peek; my $a; print_message( "Before I do anything" ); Dump( $a ); print_message( "After I assign a string" ); $a = '123456789'; Dump( $a ), print_message( "After I use it as a number" ); $b = $a + 1; Dump( $a ); sub print_message { print STDERR "\n", "-" x 50, "\n$_[0]\n", "-" x 50, "\n" }
The output shows me what Perl is tracking for that scalar at each
point in the program. When I first create the variable, it doesn’t have a
value. I can see that Perl created the scalar (in internals parlance, the
SV, or “scalar value”), it has a reference count of 1, and that it has
some flags set. The SV doesn’t have anything in it (that’s the NULL(0x0)
), but it has an address, 0x1808248
, because the scalar infrastructure is
set up and ready to go when I’m ready to give it a value.
When I assign a string to $a
, it
has more flags set and now has a PV, a “pointer value,” which really means
it’s just a string (or char *
for you C
people). Now the scalar value points to the string data.
When I use this scalar as a number for the first time, Perl has to
convert the string to a number. Once it does that, it stores the number
value too, turning my scalar into a PVIV
, meaning that it has a pointer value and an
integer value. Perl sets more flags to indicate that it’s done the
conversion and it has both values. Next time it can access the number
directly:
-------------------------------------------------- Before I do anything -------------------------------------------------- SV = NULL(0x0) at 0x1808248 REFCNT = 1 FLAGS = (PADBUSY,PADMY) -------------------------------------------------- After I assign a string -------------------------------------------------- SV = PV(0x1800908) at 0x1808248 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x301c10 "123456789"\0 CUR = 9 LEN = 10 -------------------------------------------------- After I use it as a number -------------------------------------------------- SV = PVIV(0x1800c20) at 0x1808248 REFCNT = 1 FLAGS = (PADBUSY,PADMY,IOK,POK,pIOK,pPOK) IV = 123456789 PV = 0x301c10 "123456789"\0 CUR = 9 LEN = 10
Just from that I can see that Perl is doing a lot of work. Each Perl variable has some overhead even if it doesn’t have a defined value. That’s okay because Perl’s are more useful for it.
The Devel::Size
module can tell me how much
memory my variable takes up. I have to remember, though, that the actual
memory is probably a little bit more since Perl has to align the low-level
values at the appropriate byte boundaries. It can’t just store a value
starting anywhere it likes:
#!/usr/bin/perl use Devel::Size qw(size); my $n; print_message( "Before I do anything" ); print "Size is ", size( \$n ); print_message( "After I assign a string" ); $n = '1'; print "Size is ", size( \$n ); print_message( "After I use it as a number" ); my $m = $n + 1; print "Size is ", size( \$n ); sub print_message { print "\n", "-" x 50, "\n$_[0]\n", "-" x 50, "\n" }
I see that even before I do anything, my scalar $n
takes up 12 bytes, at least. When I assign it
a string, the size of the scalar is larger, and by more than just the
number of characters in the string. Perl tacks on a null byte to terminate
the string and might have some extra space allocated in case the string
gets bigger. When I use the string as a number, Perl stores the numeric
version too, so the scalar gets even larger. Every one of these things can
use a bit more memory than I expect:
-------------------------------------------------- Before I do anything -------------------------------------------------- Size is 12 -------------------------------------------------- After I assign a string -------------------------------------------------- Size is 26 -------------------------------------------------- After I use it as a number -------------------------------------------------- Size is 31
What about references, which are also scalars? They only need to know
where to find the value, but they don’t store values themselves. They stay
the same size even when the values change. The size of a reference doesn’t
change. I have to be careful with Devel::Size
, though. If I give it a
reference, it finds the size of the thing at which the reference points.
That’s a good thing, as I’ll show when I try it with arrays or hashes.
However, if I have a reference pointing at a reference, the size of that
second reference is the size of the thing at which it points, which is
just a reference:
#!/usr/bin/perl use LWP::Simple; use Devel::Size qw(size); # watch out! This is 50+ MB big! my $data = get( "http://www.cpan.org/src/stable.tar.gz" ); print "The size of the data is " , size( $data ), "\n"; my $ref = \$data; print "The size of the reference is " , size( $ref ), "\n"; my $ref2 = \$ref; print "The size of the second reference is " , size( $ref2 ), "\n";
The output shows that the second reference is just 16 bytes. It doesn’t include all of the data stored in the ultimate scalar. I’ll show in a moment why I need to know that, but I have to look at Perl’s containers first:
The size of the data is 12829217 The size of the reference is 12829217 The size of the second reference is 16
The situation for Perl’s containers is different. Arrays are
collections of scalars, and hashes have scalar keys and scalar values.
Those scalars can be the normal variety that hold values or they can be
references. The size
function from
Devel::Size
tells us the size of the data structure.
Remember, references may point to big values, but they don’t take up that
much space themselves:
#!/usr/bin/perl use Devel::Size qw(size); my @array = ( 1 ) x 500; print "The size of the array is ", size( \@array ), "\n";
I can see how much space the array takes up. The
Devel::Size
documentation is careful to note that this
doesn’t count the size of the things in the array, just the size of the
array. Notice that the size of my 500-element array is much larger than
500 times the 16 bytes my individual scalars used:
The size of the array is 2052
That number isn’t counting the contents though. The array takes up the same size no matter what the scalars hold:
#!/usr/bin/perl use Devel::Size qw(size); my $data = '-' x 500; print "The size of the scalar is ", size( $data ), "\n"; my @array = ( $data ) x 500; print "The size of the array is ", size( \@array ), "\n";
I created a scalar with 500 characters, and the entire scalar including the overhead takes up 525 bytes. The array takes up the same space as it did previously:
The size of the scalar is 525 The size of the array is 2052
Devel::Size
has a fix for this. To get around
this, I need to look at each of the scalars in the container and find
their sizes. The reference values may point to other containers, which may
have more references. My array might look like it’s really small until I
try to make a deep copy, store it, or anything else where I have to get
all the data in one place, reference or not:
#!/usr/bin/perl use Devel::Size qw(size total_size); my $data = '-' x 500; print "The size of the scalar is ", size( $data ), "\n"; print "The total size of the scalar is ", total_size( $data ), "\n"; print "\n"; my @array = ( $data ) x 500; print "The size of the array is ", size( \@array ), "\n"; print "The total size of the array is ", total_size( \@array ), "\n";
Using total_size
, the scalar size
stays the same, and the array size now includes all the scalar sizes. The
number, 264,552, is 500 times 525, the aggregate size of the scalars added
to 2,052, the array size:
The size of the scalar is 525 The total size of the scalar is 525 The size of the array is 2052 The total size of the array is 264552
I have to remember what this number actually is, though. It’s just the aggregate size of all the data to which the array eventually points. If I did this for all of my data structures, I do not get the program memory size because those structures might contain references to the same data.
The same code can perform differently on different perl
binaries, which might differ in their
compilation options, compilers used, features included, and so on. For
instance, threaded versions of Perl are slightly slower, as are shared
library versions. It’s not necessarily bad to be slower if you want the
other benefits, but you don’t always get to choose beforehand. For
instance, the stock Perl on some Linux distributions is compiled for
threads. If you think you might get a speedup with a nonthreaded
interpreter, find out before you reconfigure your system!
To compare different perl
interpreters, Gisle Aas wrote perlbench
. I give it the paths of the
interpreters I want to test, and it runs several tests on them, producing
a report. The perlbench
distribution comes with
perlbench-run
which, given the
locations of the perl
interpreters I
want to test, runs a series of benchmarks on each of them. The program
normalizes the numbers to the time for the first interpreter I
specify:
perlbench-run /usr/local/bin/perl5*
The output first shows the details for each interpreter I’m testing
and assigns them letters that correspond to a column in the table that
it’s about to output. Especially interesting are the ccflags
information. In this run, I’m using a
perl-5.8.7
I compiled with -DDEBUGGING_MSTATS
, for instance. Also
interesting is the compiler information. It looks like I’ve got a pretty
old version of gcc
. That might be a
good or bad thing. Different versions, or even different compilers, do
better or worse jobs optimizing the code. These numbers only have relative
meaning on the same machine:
A) perl-5.6.1 version = 5.006001 path = /usr/local/bin/perl5.6.1 ccflags = -fno-strict-aliasing -I/usr/local/include gccversion = 2.95.2 19991024 (release) optimize = -O usemymalloc = n B) perl-5.8.0 version = 5.008 path = /usr/local/bin/perl5.8.0 ccflags = -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing gccversion = 2.95.2 19991024 (release) optimize = -O usemymalloc = n C) perl-5.8.7 version = 5.008007 path = /usr/local/bin/perl5.8.7 ccflags = -DDEBUGGING_MSTATS gccversion = 2.95.4 20020320 [FreeBSD] optimize = -g usemymalloc = y D) perl-5.8.8 version = 5.008008 path = /usr/local/bin/perl5.8.8 ccflags = -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing gccversion = 2.95.4 20020320 [FreeBSD] optimize = -O usemymalloc = n
After perlbench-run
reports the
details of the interpreter, it runs a series of Perl programs with each of
the interpreters. It measures the time to execute, much like
Benchmark
’s timethese
. Once it tries the program with all of
the interpreters, it normalizes the results so that the first interpreter
(that’s the one labeled with “A”) is 100. Lower numbers in the other
column mean that interpreter is slower for that test. Higher numbers (they
can be above 100) mean that interpreter is faster for that test. The
number only has meaning for that test, and I can’t compare them to a
different test, even in the same run.
I’ve cut out some of the output from these results, but this chart
gives you the flavor of the comparisons. Interpreter C, the one compiled
with -DDEBUGGING_MSTATS
, is
consistently slower than all of the other interpreters. For the other
tests, sometimes Perl 5.6.1 is faster and sometimes Perl 5.8.8 is faster.
That’s not a general observation since it only applies to the ones I’ve
compiled. Overall, it looks as if my Perl 5.6.1 is the fastest. That still
doesn’t mean I should choose it, though, because for a slight penalty I
get all the nice features of Perl 5.8:
A B C D --- --- --- --- arith/mixed 100 85 73 79 arith/trig 100 87 82 81 array/copy 100 99 81 92 array/foreach 100 93 87 99 array/shift 100 100 94 91 array/sort-num 100 89 92 151 array/sort 100 95 80 94 call/0arg 100 107 79 91 call/1arg 100 92 69 78 call/wantarray 100 95 76 80 hash/copy 100 130 94 124 hash/each 100 119 90 110 hash/foreach-sort 100 103 78 102 loop/for-c 100 102 88 104 loop/for-range-const 100 101 94 106 loop/for-range 100 100 94 104 re/const 100 92 81 88 string/base64 100 86 67 72 string/htmlparser 100 91 75 74 string/tr 100 105 51 111 AVERAGE 100 97 80 91 Results saved in file:///home/brian/perlbench-0.93/benchres-002/index.html
If I have something special to test, I can add my own test files.
Most of the infrastructure is already in place. The
README from the perlbench
distribution gives the basic format of
a test file. I create my test and put it in perlbench’s benchmark directory. The distribution gives an
example file:
# Name: My name goes here # Require: 4 require 'benchlib.pl'; # YOUR SETUP CODE HERE $a = 0; &runtest(100, <<'ENDTEST'); # YOUR TESTING CODE HERE ENDTEST
Benchmarking is a tricky subject. It involves a lot of numbers and requires a good understanding of what’s actually going on. Not only do I have to look at my Perl program, but I should consider other factors, such as my choice in operating system, the Perl interpreter I’m using and how I compiled it, and anything else that interacts with my program. It’s not all about speed, either. I might want to compare the memory use of two approaches, or see which one takes up less bandwidth. Different situations have different constraints. No matter what I’m doing, I need to do my best to find out what’s really going on before I make any conclusions about how to make it better.
The Benchmark
module provides all of the details
of its use. The module comes with Perl so you should already have
it.
In “A Timely Start,” Jean-Louis Leroy finds that his Perl program was slow because of the time it took to find the modules it needed to load: http://www.perl.com/lpt/a/2005/12/21/a_timely_start.html.
In “When Perl Isn’t Quite Fast Enough,” Perl developer Nick Clark talks about why programs, in general, are slow, and which Perl design decisions can make Perl slow. The best part of his talk, which he originally gave at YAPC::EU 2002, is his solutions to speed up his programs. I heard his talk on PerlWhirl 2005, and he related most of his discussion to his work to make Perl’s Unicode handling faster. If you get a chance to see his talk, take it! I think you’ll be entertained as well as educated.
I originally wrote the parts about benchmarking the Schwartzian Transform for Perlmonks in a node titled “Wasting Time Thinking about Wasted Time.” I nicked it almost verbatim from my original post: http://www.perlmonks.org/index.pl?node=393128. I still use that post in Stonehenge’s Perl classes to show that even “experts” can mess up benchmarks.
The second Perl article I ever wrote was “Benchmarking Perl” for
The Perl Journal number 11, in which I show some of
the other functions in Benchmark
: http://www.pair.com/comdog/Articles/benchmark.1_4.txt.
The perlbench
distribution isn’t
indexed in CPAN when I wrote this, but you can still find it
through CPAN Search: http://search.cpan.org. Check
the README file for its
documentation.
[27] Sea level isn’t a good benchmark either, because there is really no such thing. Not only do tides affect the height of water, but the oceans tend to tilt against the earth’s rotation. Sea level is actually different around the world because the level of the sea is different.
[28] “A Timely Start” (http://www.perl.com/lpt/a/2005/12/21/a_timely_start.html).
[29] If you don’t have this tool, the Perl Power Tools Project (http://search.cpan.org/dist/ppt/) has a Perl implementation of it, and in a moment I’ll implement my own.
[30] The gory details of Perl’s memory management could take up a whole book. I’ll cover the general idea here and leave it to you to go through perldebguts.
Get Mastering Perl now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.