NAME
Dumbbench  More reliable benchmarking with the least amount of thinking
SYNOPSIS
Command line interface: (See dumbbench help
)
dumbbench p 0.005  ./testprogram testprogramoption
This will start churning for a while and then prints something like:
Ran 23 iterations of the command.
Rejected 3 samples as outliers.
Rounded run time per iteration (seconds): 9.519e01 +/ 3.7e03 (0.4%)
As a module:
use Dumbbench;
my $bench = Dumbbench>new(
target_rel_precision => 0.005, # seek ~0.5%
initial_runs => 20, # the higher the more reliable
);
$bench>add_instances(
Dumbbench::Instance::Cmd>new(name => 'fork', command => [qw(perl e 'something')]),
Dumbbench::Instance::PerlEval>new(name => 'eval', code => 'for(1..1e7){something}'),
Dumbbench::Instance::PerlSub>new(name => 'sub', code => sub {for(1..1e7){something}}),
);
# (Note: Comparing the run of externals commands with
# evals/subs probably isn't reliable)
$bench>run;
$bench>report;
DESCRIPTION
This module attempts to implement reasonably robust benchmarking with little extra effort and expertise required from the user. That is to say, benchmarking using this module is likely an improvement over
time somecommand to benchmark
or
use Benchmark qw/timethis/;
timethis(1000, 'system("somecommand", ...)');
The module currently works similar to the former command line, except (in layman terms) it will run the command many times, estimate the uncertainty of the result and keep iterating until a certain userdefined precision has been reached. Then, it calculates the resulting uncertainty and goes through some pain to discard bad runs and subtract overhead from the timings. The reported timing includes an uncertainty, so that multiple benchmarks can more easily be compared.
Please note that Dumbbench
works entirely with wallclock time as reported by Time::HiRes
' time()
function.
METHODS
In addition to the methods listed here, there are readonly accessors for all named arguments of the constructor (which are also object attributes).
new
Constructor that takes the following arguments (with defaults):
verbosity => 0, # 0, 1, or 2
target_rel_precision => 0.05, # 5% target precision
target_abs_precision => 0, # no target absolute precision (in s)
intial_runs => 20, # no. of guaranteed initial runs
max_iterations => 10000, # hard max. no of iterations
variability_measure => 'mad', # method for calculating uncertainty
outlier_rejection => 3, # no. of "sigma"s for the outlier rejection
variability_measure
and outlier_rejection
probably make sense after reading HOW IT WORKS
below. Setting outlier_rejection
to 0 will turn it off entirely.
add_instances
Takes one or more instances of subclasses of Dumbbench::Instance as argument. Each of those is one benchmark, really. They are run in sequence and reported separately.
Right now, there are the following Dumbbench::Instance
implementations:
Dumbbench::Instance::Cmd for running/benchmarking external commands.
Dumbbench::Instance::PerlEval for running/benchmarking Perl code in this same process using
eval
.Dumbbench::Instance::PerlSub for running/benchmarking Perl code in this same process using a subroutine reference.
You probably want to pass a name
parameter to the new
method of each Dumbbench::Instance subclasses. That will make clearer in the report
output which line corresponds to which instance passed.
run
Runs the dryrun and benchmark run.
report
Prints a short report about the benchmark results.
instances
Returns a list of all instance objects in this benchmark set. The instance objects each have a result()
and dry_result()
method for accessing the numeric benchmark results.
box_plot
Returns a Dumbbench::BoxPlot instance. Note that you need SOOT installed to use that module, but this does not require it as a prerequisite since it's not a trivial installation.
A Dumbbench::BoxPlot is a nice an easy way to get a graphic chart if you're in the mood instead of getting the same results from report
.
If you don't want to get into the details of Dumbbench::BoxPlot, you can do:
# $bench is your Dumbbench instance
eval { $bench>box_plot>show };
HOW IT WORKS AND WHY IT DOESN'T
Why it doesn't work and why we try regardless
Recall that the goal is to obtain a reliable estimate of the runtime of a certain operation or command. Now, please realize that this is impossible since the runtime of an operation may depend on many things that can change rapidly: Modern CPUs change their frequency dynamically depending on load. CPU caches may be invalidated at odd moments and page faults provide less finegrained distration. Naturally, OS kernels will do weird things just to spite you. It's almost hopeless.
Since people (you, I, everybody!) insist on benchmarking anyway, this is a besteffort at estimating the runtime. Naturally, it includes estimating the uncertainty of the run time. This is extremely important for comparing multiple benchmarks and that is usually the ultimate goal. In order to get an estimate of the expectation value and its uncertainty, we need a model of the underlying distribution:
A model for timing results
Let's take a step back and think about how the runtime of multiple invocations of the same code will be distributed. Having a qualitative idea what the distribution of many (MANY) measurements looks like is extremely important for estimating the expectation value and uncertainty from a sample of few measurements.
In a perfect, deterministic, singletasking computer, we will get N times the exact same timing. In the real world, there are at least a million ways that this assumption is broken on a small scale. For each run, the load of the computer will be slightly different. The content of main memory and CPU caches may differ. All of these small effects will make a given run a tiny bit slower or faster than any other. Thankfully, this is a case where statistics (more precisely the Central Limit Theorem) provides us with the qualitative result: The measurements will be normally distributed (i.e. following a Gaussian distribution) around some expectation value (which happens to be the mean in this case). Good. Unfortunately, benchmarks are more evil than that. In addition to the smallscale effects that smear the result, there are things that (at the given run time of the benchmark) may be large enough to cause a large jump in run time. Assuming these are comparatively rare and typically cause extraordinarily long runtimes (as opposed to extraordinarily low runtimes), we arrive at an overall model of having a central, smoothish normal distribution with a few outliers towards long runtimes.
So in this model, if we perform N
measurements, almost all N
times will be close to the expectation value and a fraction will be significantly higher. This is troublesome because the outliers create a bias in the uncertainty estimation and the asymmetry of the overall distribution will bias a simple calculation of the mean.
What we would like to report to the user is the mean and uncertainty of the main distribution while ignoring the outliers.
A robust estimation of the expectation value
Given the previously discussed model, we estimate the expectation value with the following algorithm:
 1)

Calculate the median of the whole distribution. The median is a fairly robust estimator of the expectation value with respect to outliers (assuming they're comparatively rare).
 2)

Calculate the medianabsolutedeviation from the whole distribution (MAD, see wikipedia). The MAD needs rescaling to become a measure of variability. The MAD will be our initial guess for an uncertainty. Like the median, it is quite robust against outliers.
 3)

We use the median and MAD to remove the tails of our distribution. All timings that deviate from the median by more than
$X
times the MAD are rejected. This measure should cut outliers without introducing much bias both in symmetric and asymmetric source distributions.An alternative would be to use an ordinary truncated mean (that is the mean of all timings while disregarding the
$N
largest and$N
smallest results). But the truncated mean can produce a biased result in asymmetric source distributions. The resulting expectation value would be artificially increased.In summary: Using the median as the initial guess for the expectation value and the MAD as the guess for the variability keeps the bias down in the general case.
 4)

Finally, the use the mean of the truncated distribution as the expectation value and the MAD of the truncated distribution as a measure of variability. To get the uncertainty on the expectation value, we take
MAD / sqrt($N)
where$N
is the number of remaining measurements.
Conclusion
I hope I could convince you that interpreting less sophisticated benchmarks is a dangerous if not futile exercise. The reason this module exists is that not everybody is willing to go through such contortions to arrive at a reliable conclusion, but everybody loves benchmarking. So let's at least get the basics right. Do not compare raw timings of meaningless benchmarks but robust estimates of the run time of meaningless benchmarks instead.
SEE ALSO
Number::WithError does the Gaussian error propagation.
AUTHOR
Steffen Mueller, <smueller@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2010, 2011, 2012, 2013 by Steffen Mueller
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.1 or, at your option, any later version of Perl 5 you may have available.