The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Random::Skew::Test - Handy means for testing (and fine tuning) Random::Skew.

SYNOPSIS

  use Random::Skew::Test;
  my @results = Random::Skew::Test->sample(
    iter => 2_500_000,
    skew => {
        huge  => 5000,
        mid   => 121,
        teeny => 3,
    },
    grain => [ qw/10 27 293/ ],
    round => [ qw/0 .5/ ],
  );
  print @results;

DESCRIPTION

Tests Random::Skew algorithm and generates printable results. Can be useful for learning which granularity values ($Random::Skew::GRAIN) and what rounding values ($Random::Skew::ROUNDING) are best for your uses.

The sample() method takes these parameters:

iter
    iter => 5_000_000,

This integer is how many iterations to run for the test, where $Random::Skew returns this many weighted-random items. It's quite fast, you can do 10_000_000 iterations of many configurations in just a few seconds.

skew

This hashref represents your weighted scale of items to return. The values in the skew hash represent how likely the keys are to be returned randomly.

    skew => {
        Ubiquitous => 39_999,
        Mucho      => 1962,
        Sometimes  => 19,
        Unusual   => 4,
    }
grain

This arrayref sets the max size (how many buckets) of the sampling set, which determines how much 'rounding' you might experience. It runs a separate test for each 'grain'.

    grain => [ qw/24 75 159 890/ ]

The idea is, $GRAIN establishes how coarse the buckets are for your set of items. Example: If you have skew values of 40, 30, 20, 10 you can scale those down with perfect fidelity with a grain of 10 buckets (4 tens, 3 tens, 2 tens, 1 ten is exactly represented, proportionally, by 4, 3, 2, 1). If you have $GRAIN=8 buckets you'd have these ten items squeezed into 3, 2, 1 with a smaller subset for the tiny item, whereas with $GRAIN=13 buckets you'd have 5, 3, 2, 1. In these cases some items will be slightly over-represented and others will be slightly under-represented due to rounding.

Astonishingly, for $Random::Skel::GRAIN, small values (13, 28, 41 etc) work amazingly well in many cases, but you could use a ridiculously high number (2500? 50_000?) if you have the RAM and want to give it a try. Take it out for a spin.

You can't have a $GRAIN less than 2, and 2 won't be too useful in most cases. You'll likely want to use values 10 or more. Experiment.

Note that if you have $Random::Skew::ROUNDING greater than zero (should only be between zero and one) then it's possible you actually wind up with a few more buckets than $GRAIN.

round
    round => [ qw/0.25 0.5 0.75/ ]

This arrayref specifies various values to try for $Random::Skew::ROUNDING. They can be between 0.0 and 1.0.

For each setting of $Random::Skew::GRAIN it runs $iter tests and generates output. Each test has two sections: structure and results.

EXAMPLE

For the example below, we are using these skew=>{} weights:

    skew => {
        bigone => 500,
        bigtwo => 400,
        smone  => 50,
        smtwo  => 40,
        tiny1  => 5,
        tiny2  => 4,
        nano   => 1,
    }

Here, the total population $tot requested is 500+400+50+40+5+4+1, or 1000.

Data Structure

With the sample output below, it is showing three levels of Random::Skew. The large items are 'bigone', 'bigtwo' and 'smone'. For the middle set the items are 'smtwo', 'tiny1' and 'tiny2'. For the third and smallest set, there's only 'nano'.

        Grain=20 (Rounding=+0): -=-=-=-=-=-=-=-=-=-
        bigone  10
        bigtwo  8
        smone   1
        ...and smaller (rand(0..20) < 1):
          smtwo    16
          tiny1    2
          tiny2    1
          ...and smaller (rand(0..20) < 0.4):
                nano      1

Here we see $Random::Skew::GRAIN is 20 and $Random::Skew::ROUNDING is zero. The top set includes 'bigone', 'bigtwo' and 'smone'. Then the indentation indicates there's a smaller set for 'smtwo', 'tiny1' and 'tiny2'. And finally a third set for the smallest items, containing only 'nano'.

Given that the $tot total population is 1000, we $scale everything down by multiplying by 20/1000 or 0.02. In the top-level set, we have 'bigone' with a weight of 500, scaled down to 10 buckets; 'bigtwo' with a weight of 400 scaled down to 8 buckets, 'smone' with a weight of 50 scaled down to 1 bucket -- and the remaining items have weights that are so miniscule, none of them are big enough to be represented by a whole bucket at this scale. The smaller items are represented by recursion into a "sub-set" of their own with a different, appropriate $scale. When picking a random item, we grab a random floating point number (between 0.0 and 20.0). When that number is < 1.0 it will call on the middle set, via recursion. Otherwise we quickly return whichever item is in the array at that offset ('bigone' or 'bigtwo' or 'smone').

The middle set is similar, and has its own scale. The items we are working with now are 'smtwo' (40), 'tiny1' (5), 'tiny2' (4) and 'nano' (1) since 'bigone' and 'bigtwo' and 'smone' are already handled by the top-level set. For this smaller set, our $tot is 50 instead of 1000, which we $scale down to 20 using a multiplier of 20/50 or 0.4. Item 'smtwo' with a weight of 40 is scaled down to 16 buckets; 'tiny1' with a weight of 5 gets 2 buckets, and 'tiny2' with a weight of 4 gets 1 bucket. At this scale, 'nano' would only be 4/10 of a bucket, which is represented by the "(rand(0..20) < 0.4)" in the output. That is, when the random number (between 0.0 and 20.0) is < 0.4 it calls upon the third level for a random item; when it is between 1.0 to 20.0 it returns 'smtwo' (16 times out of 19), or 'tiny1' (2 times out of 19), or 'tiny2' (1 time out of 19). Did you notice the gap? There's a gap, between 0.4 and 1.0. If the random number is between 0.4 and 1.0 it picks another random number between 0.0 to 20.0 and tries again.

The third set has the tiniest bits from our original population. In this case it's only 'nano' with a weight of 1. The whole set is just one item 'nano' with no further recursion needed (there's nothing smaller than 'nano' in our weighted items). So when we get to this point we pick an item at random, which is always 'nano' all day, every day, since there's only one. Note that to get here, the top-level (big pieces) set needs a random number < 1.0 out of 20.0 to get to the middle set, and then the middle-level set needs a random number < 0.4 out of 20.0. That comes to a likelihood of 1/20 x 0.4/20 = 0.01 which exactly matches the weighted proportions of 'nano' from our original population.

If you have a large, varied population and a small $Random::Skew::GRAIN then you could have a structure that's pretty deep. If you have a large $Random::Skew::GRAIN or a small variance in a small population you might have a very shallow structure.

Test Results

The columns displayed show actual random-skew count generated, random-skew scale requested, and ratio.

        --1000000x:
                        Returned        Requested       Ratio
                        ========        =========       =====
        bigone: 500k (50.025)   500 (   50)     1.0005
        bigtwo: 399k (39.978)   400 (   40)     0.9995
        smone:  50k (5.0166)    50 (    5)      1.0033
        smtwo:  41k (4.1194)    40 (    4)      1.0298
        tiny1:  4982 (0.4982)   5 (  0.5)       0.9964
        tiny2:  2587 (0.2587)   4 (  0.4)       0.6467 <-- low
        nano:   1041 (0.1041)   1 (  0.1)       1.0410

This test ran a million iterations (as shown by "1000000x").

Let's start with the middle column first.

Middle column: Weights REQUESTED. In this example they are 500, 400, 50, 40, 5, 4, 1 (which makes the total population $tot=1000 in this case). 500 is 50% of the total, 400 is 40%, and so on down to 1 being 0.1% of the total.

Left column: Weights RETURNED from actually requesting the randomized items. In this example of 1000000 iterations, we saw 'bigone' returned 50.025% of the time, which is really close to the 50% requested.

Right column: RATIO (returned % / requested %). So in this case, 'bigone' showed up 50.025% of the time; in our specification we requested it 50% of the time, and the ratio of the two is 1.0005 which is close to spot-on.

Here's the output with the actual counts omitted, only showing the percents:

        bigone: (50.025)   (   50)     1.0005
        bigtwo: (39.978)   (   40)     0.9995
        smone:  (5.0166)   (    5)     1.0033
        smtwo:  (4.1194)   (    4)     1.0298
        tiny1:  (0.4982)   (  0.5)     0.9964
        tiny2:  (0.2587)   (  0.4)     0.6467 <-- low
        nano:   (0.1041)   (  0.1)     1.0410

If the rand() function returns a homogenous spread of values, we expect the values in the third column to be close to 1.0... closer and closer, the more items we request.

For large values of $Random::Skel::GRAIN the items on the big side of any one set will typically be very close to the proportions requested, and the items on the small side can be over-represented a bit, or under-represented a bit. You typically want a small tolerance for variation in the third column -- say, from 0.9 to 1.1 (your tolerance will depend on your requirements).

In our illustration above, 'tiny2' showed up only 0.2587% of the time and we were hoping for 0.4% of the time, which brings the comparison/ratio column to 0.6467. This might be adequate for your requirements, or it may not. If it's critical to make sure each segment of a set is spot-on, it's worth tinkering with $Random::Skel::GRAIN (perhaps SMALLER! no, really! recursion often nails it when a large-grain set won't) and/or $Random::Skel::ROUNDING to get your third-column results closer to 1.0.

SEE ALSO

Random::Skew

AUTHOR

If you find this library useful, I'd like to hear about it. :)

will@serensoft.com

COPYRIGHT AND LICENSE

Copyright (C) 2022 by Will Trillich

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.28.1 or, at your option, any later version of Perl 5 you may have available.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 453:

You forgot a '=back' before '=head1'