The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Statistics::Distribution::Generator - A way to compose complicated probability functions

VERSION

Version 1.003

SYNOPSIS

  use Statistics::Distribution::Generator qw( :all );
  my $g = gaussian(3, 1);
  say $g; # something almost certainly between -3 and 9, but probably about 2 .. 4-ish
  my $cloud = gaussian(0, 1) x gaussian(0, 1) x gaussian(0, 1);
  say @$cloud; # a 3D vector almost certainly within (+/- 6, +/- 6, +/- 6) and probably within (+/- 2, +/- 2, +/- 2)
  my $combo = gaussian(100, 15) | uniform(0, 200); # one answer with an equal chance of being picked from either distribution

A NOTE ON TEST FAILURES

The test suite for this module is imperfect, precisely because it's hard to predict what random numbers will do, so you may run into unrepeatable test "failures" that are merely the result of something very unlikely happening, rather than something being broken. I can only recommend retrying the test suite / installation process, though if it takes more than a couple of tries, you've probably found a real bug, which should be reported through CPAN RT.

DESCRIPTION

This module allows you to bake together multiple "simple" probability distributions into a more complex random number generator. It does this lazily: when you call one of the PDF generating functions, it makes an object, the value of which is not calculated at creation time, but rather re-calculated each and every time you try to read the value of the object. If you are familiar with Functional Programming, you can think of the exported functions returning functors with their "setup" values curried into them.

To this end, two of Perl's operators (x and |) have been overloaded with special semantics.

The x operator composes multiple distributions at once, giving an ARRAYREF of "answers" when interrogated, which is designed primarily to be interpreted as a vector in N-dimensional space (where N is the number of elements in the ARRAYREF).

The | operator composes multiple distributions into a single value, giving a SCALAR "answer" when interrogated. It does this by picking at random between the composed distributions (which may be weighted to give some higher precendence than others).

The first thing to note is that x and | have their normal Perl precendence and associativity. This means that parens are strongly advised to make your code more readable. This may be fixed in later versions of this module, by messing about with the B modules, but that would still not make parens a bad idea.

The second thing to note is that x and | may be "nested" arbitrarily many levels deep (within the usual memory & CPU limits of your computer, of course). You could, for instance, compose multiple "vectors" of different sizes using x to form each one, and select between them at random with , e.g.

  my $forwards = gaussian(0, 0.5) x gaussian(3, 1) x gaussian(0, 0.5);
  my $backwards = gaussian(0, 0.5) x gaussian(-3, 1) x gaussian(0, 0.5);
  my $left = gaussian(-3, 1) x gaussian(0, 0.5) x gaussian(0, 0.5);
  my $right = gaussian(3, 1) x gaussian(0, 0.5) x gaussian(0, 0.5);
  my $up = gaussian(0, 0.5) x gaussian(0, 0.5) x gaussian(3, 1);
  my $down = gaussian(0, 0.5) x gaussian(0, 0.5) x gaussian(-3, 1);
  my $direction = $forwards | $backwards | $left | $right | $up | $down;
  $robot->move(@$direction);

You are strongly encouraged to seek further elucidation at Wikipedia or any other available reference site / material.

EXPORTABLE FUNCTIONS

gaussian(MEAN, SIGMA)

Gaussian Normal Distribution. This is the classic "bell curve" shape. Numbers close to the MEAN are more likely to be selected, and the value of SIGMA is used to determine how likely more-distant values are. For instance, about 2/3 of the "answers" will be in the range (MEAN - SIGMA) <= N <= (MEAN + SIGMA), and around 99.5% of the "answers" will be in the range (MEAN - 3 * SIGMA) <= N <= (MEAN + 3 * SIGMA). "Answers" as far away as 6 * SIGMA are approximately a 1 in a million long shot.

uniform(MIN, MAX)

A Uniform Distribution, with equal chance of any N where MIN <= N < MAX. This is equivalent to Perl's standard rand() function, except you supply the MIN and MAX instead of allowing them to fall at 0 and 1 respectively. Any value within the range should be equally likely to be chosen, provided you have a "good" random number generator in your computer.

logistic

The Logistic Distribution is used descriptively in a wide variety of fields from market research to the design of neural networks, and is also known as the hyperbolic secant squared distribution.

supplied(VALUE)
supplied(CALLBACK)

Allows the caller to supply either a constant VALUE which will always be returned as is, or a coderef CALLBACK that may use any algorithm you like to generate a suitable random number. For now, this is the main plugin methodology for this module. The supplied CALLBACK is given no arguments, and SHOULD return a numeric answer. If it returns something non-numeric, you are entirely on your own in how to interpret that, and you are probably doing it wrongly.

gamma(ORDER, SCALE)

The Gamma Distribution function is a generalization of the chi-squared and exponential distributions, and may be given by

  p(x) dx = {1 \over \Gamma(a) b^a} x^{a-1} e^{-x/b} dx
  for x > 0.

The ORDER argument corresponds to what is also known as the "shape parameter" k, and the SCALE argument corresponds to the "scale parameter" theta.

If k is an integer, the Gamma Distribution is equivalent to the sum of k exponentially-distributed random variables, each of which has a mean of theta.

exponential(LAMBDA)

The Exponential Distribution function is often useful when modeling / simulating the time between events in certain types of system. It is also used in reliability theory and the Barometric formula in physics.

dice(COUNT, SIDES)

The dice distribution mimics what you'd get when rolling COUNT dice, each with SIDES sides (numbered sequentially starting at 1), and summing the result.

OVERLOADED OPERATORS

x

Allows you to compose multi-dimensional random vectors.

  $randvect = $foo x $bar x $baz; # generate a three-dimensional vector
|

Allows you to pick a single (optionally weighted) generator from some set of generators.

  $cointoss = supplied 0 | supplied 1; # fair 50:50 result of either 0 or 1

OBJECT ATTRIBUTES

$distribution->{ weight }

This setting may be used to make |-based selections favor one or more outcomes more (or less) than the remaining outcomes. The default weight for all outcomes is 1. Weights are relative, not absolute, so may be scaled however you need.

  $foo = exponential 1.5;
  $bar = gaussian 20, 1.25;
  $foo->{ weight } = 6;
  $quux = $foo | $bar; # 6:1 chance of picking $foo instead of $bar

AUTHOR

The main body of this work is by Paul W Bennett

The idea of composing probabilities together is inspired by work done by Sooraj Bhat, Ashish Agarwal, Richard Vuduc, and Alexander Gray at Georgia Tech and NYU, published around the end of 2011.

The implementation of the Gamma Distribution is by NWETTERS and is used with permission.

CAVEATS

Almost no error checking is done. Garbage in will result in garbage out.

Although this has finally reached a version >= 1.x, and I will keep the API as backwards-compatible as possible, be aware that the internals are still open for change, and subsequent changes may introduce regression failures.

WISHLIST

If you know statistics and you know how to model distributions not listed herein, you're encouraged to open a CPAN RT ticket describing how to do those probability distribution functions.

ULTRA-WISHLIST

If you know how to take the first k moments (and/or the first l L-moments) of a distribution, and reverse engineer it into a generator, please, please get in touch via CPAN RT. I am painfully aware that this is something sorely missing from this module.

TODO

More PDFs

Build in more probability density functions.

More Tests

Lots of very clever tests. Probably something with Data::IEEE754::Tools and more use of Statistics::Descriptive.

Give it some Acme?

Enhance the dice() distribution to take the same args as Acme::Dice?

LICENSE

Artistic 2.0