The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Statistics::Descriptive::LogScale - Memory-efficient approximate univariate descriptive statistics class.

VERSION

Version 0.10

SYNOPSIS

Basic usage

The basic usage is roughly the same as that of Statistics::Descriptive::Full.

    use Statistics::Descriptive::LogScale;
    my $stat = Statistics::Descriptive::LogScale->new ();

    while(<>) {
        chomp;
        $stat->add_data($_);
    };

    # This can also be done in O(1) memory, precisely
    printf "Mean: %f +- %f\n", $stat->mean, $stat->standard_deviation;
    # This requires storing actual data, or approximating
    printf "25%%  : %f\n", $stat->percentile(25);
    printf "Median: %f\n", $stat->median;
    printf "75%%  : %f\n", $stat->percentile(75);

Save/load

This is not present in Statistics::Descriptive::Full. The save/load interface is designed compatible with JSON::XS. However, any other serializer can be used. The TO_JSON method is guaranteed to return unblessed hashref with enough information to restore the original object.

    use Statistics::Descriptive::LogScale;
    my $stat = Statistics::Descriptive::LogScale->new ();

    # ..... much later
    # Save
    print $fd encoder_of_choice( $stat->TO_JSON )
        or die "Failed to save: $!";

    # ..... and even later
    # Load
    my $plain_hash = decoder_of_choice( $raw_data );
    my $copy_of_stat = Statistics::Descriptive::LogScale->new( %$plain_hash );

    # Import into existing LogScale instance
    my $plain_hash = decoder_of_choice( $more_raw_data );
    $copy_of_stat->add_data_hash( $plain_hash->{data} );

Histograms

Both Statistics::Descriptive::Full and Statistics::Descriptive::LogScale offer frequency_distribution_ref method for querying data point counts. However, there's also histogram method for making pretty pictures. Here's a simple text-based histogram. A proper GD example was too long to fit into this margin.

    use strict;
    use warnings;

    use Statistics::Descriptive::LogScale;
    my $stat = Statistics::Descriptive::LogScale->new ();

    # collect/load data ...
    my $re_float = qr([-+]?(?:\d+\.?\d*|\.\d+)(?:[Ee][-+]?\d+)?);
    while (<>) {
        $stat->add_data($_) for m/($re_float)/g;
    };
    die "Empty set"
        unless $stat->count;

    # get data in [ count, lower_bound, upper_bound ] format as arrayref
    my $hist = $stat->histogram( count => 20 );

    # find maximum value to use as a scale factor
    my $scale = $hist->[0][0];
    $scale < $_->[0] and $scale = $_->[0] for @$hist;

    foreach (@$hist) {
        printf "%10f %s\n", $_->[1], '#' x ($_->[0] * 68 / $scale);
    };
    printf "%10f\n", $hist->[-1][2];

DESCRIPTION

This module aims at providing some advanced statistical functions without storing all data in memory, at the cost of certain (predictable) precision loss.

Data is represented by a set of bins that only store counts of fitting data points. Most bins are logarithmic, i.e. lower end / upper end ratio is constant. However, around zero linear approximation may be user instead (see "linear_width" and "linear_thresh" parameters in new()).

All operations are then performed on the bins, introducing relative error which does not, however, exceed the bins' relative width ("base").

METHODS

new( %options )

%options may include:

  • base - ratio of adjacent bins. Default is 10^(1/232), which gives 1% precision and exact decimal powers. This value represents acceptable relative error in analysis results.

    NOTE Actual value may be slightly less than requested one. This is done so to avoid troubles with future rounding in (de)serialization.

  • linear_width - width of linear bins around zero. This value represents precision of incoming data. Default is zero, i.e. we assume that the measurement is precise.

    NOTE Actual value may be less (by no more than a factor of base) so that borders of linear and logarithmic bins fit nicely.

  • linear_thresh - where to switch to linear approximation. If only one of linear_thresh and linear_width is given, the other will be calculated. However, user may want to specify both in some cases.

    NOTE Actual value may be less (by no more than a factor of base) so that borders of linear and logarithmic bins fit nicely.

  • only_linear = 1 (EXPERIMENTAL) - throw away log approximation and become a discrete statistics class with fixed precision. linear_width must be given in this case.

    NOTE This obviously kills memory efficiency, unless one knows beforehand that all values come from a finite pool.

  • data - hashref with { value = weight }> for initializing data. Used for cloning. See add_data_hash().

  • zero_thresh - absolute value threshold below which everything is considered zero. DEPRECATED, linear_width and linear_thresh override this if given.

General statistical methods

These methods are used to query the distribution properties. They generally follow the interface of Statistics::Descriptive and co, with minor additions.

All methods return undef on empty data set, except for count, sum, sumsq, stdandard_deviation and variance which all return 0.

NOTE This module caches whatever it calculates very agressively. Don't hesitate to use statistical functions (except for sum_of/mean_of) more than once. The cache is deleted upon data entry.

clear

Destroy all stored data.

add_data( @data )

Add numbers to the data pool.

Returns self, so that methods can be chained.

If incorrect data is given (i.e. non-numeric, undef), an exception is thrown and only partial data gets inserted. The state of object is guaranteed to remain consistent in such case.

NOTE Cache is reset, even if no data was actually inserted.

NOTE It is possible to add infinite values to data pool. The module will try and calculate whatever can still be calculated. However, no portable way of serializing such values is done yet.

count

Returns number of data points.

min

max

Values of minimal and maximal bins.

NOTE: Due to rounding, some of the actual inserted values may fall outside of the min..max range. This may change in the future.

sample_range

Return sample range of the dataset, i.e. max() - min().

sum

Return sum of all data points.

sumsq

Return sum of squares of all datapoints.

mean

Return mean, or average value, i.e. sum()/count().

variance

variance( $correction )

Return data variance, i.e. E((x - E(x)) ** 2).

Bessel's correction (division by n-1 instead of n) is used by default. This may be changed by specifying $correction explicitly.

NOTE The binning strategy used here should also introduce variance bias. This is not yet accounted for.

standard_deviation

standard_deviation( $correction )

std_dev

stdev

Return standard deviation, i.e. square root of variance.

Bessel's correction (division by n-1 instead of n) is used by default. This may be changed by specifying $correction explicitly.

NOTE The binning strategy used here should also introduce variance bias. This is not yet accounted for.

cdf ($x)

Cumulative distribution function. Returns estimated probability of random data point from the sample being less than $x.

As a special case, cdf(0) accounts for half of zeroth bin count (if any).

Not present in Statistics::Descriptive::Full, but appears in Statistics::Descriptive::Weighted.

cdf ($x, $y)

Returns probability of a value being between $x and $y ($x <= $y). This is essentially cdf($y)-cdf($x).

percentile( $n )

Find $n-th percentile, i.e. a value below which lies $n % of the data.

0-th percentile is by definition -inf and is returned as undef (see Statistics::Descriptive).

$n is a real number, not necessarily integer.

quantile( 0..4 )

From Statistics::Descriptive manual:

  0 => zero quartile (Q0) : minimal value
  1 => first quartile (Q1) : lower quartile = lowest cut off (25%) of data = 25th percentile
  2 => second quartile (Q2) : median = it cuts data set in half = 50th percentile
  3 => third quartile (Q3) : upper quartile = highest cut off (25%) of data, or lowest 75% = 75th percentile
  4 => fourth quartile (Q4) : maximal value

median

Return median of data, a value that divides the sample in half. Same as percentile(50).

trimmed_mean( $ltrim, [ $utrim ] )

Return mean of sample with $ltrim and $utrim fraction of data points remover from lower and upper ends respectively.

ltrim defaults to 0, and rtrim to ltrim.

harmonic_mean

Return harmonic mean of the data, i.e. 1/E(1/x).

Return undef if division by zero occurs (see Statistics::Descriptive).

geometric_mean

Return geometric mean of the data, that is, exp(E(log x)).

Dies unless all data points are of the same sign.

skewness

Return skewness of the distribution, calculated as n/(n-1)(n-2) * E((x-E(x))**3)/std_dev**3 (this is consistent with Excel).

kurtosis

Return kurtosis of the distribution, that is 4-th standardized moment - 3. The exact formula used here is consistent with that of Excel and Statistics::Descriptive.

central_moment( $n )

Return $n-th central moment, that is, E((x - E(x))^$n).

Not present in Statistics::Descriptive::Full.

std_moment( $n )

Return $n-th standardized moment, that is, E((x - E(x))**$n) / std_dev(x)**$n.

Not present in Statistics::Descriptive::Full.

abs_moment( $power, [$offset] )

Return $n-th moment of absolute value, that is, E(|x - offset|^$n).

Default value for offset if E(x). Power may be fractional.

NOTE Experimental. Not present in Statistics::Descriptive::Full.

std_abs_moment( $power, [$offset] )

Returns standardized absolute moment - like above, but scaled down by a factor of to standard deviation to n-th power.

That is, E(|x - offset|^$n) / E(|x - offset|^2)^($n/2)

Default value for offset if E(x). Power may be fractional.

NOTE Experimental. Not present in Statistics::Descriptive::Full.

mode

Mode of a distribution is the most common value for a discrete distribution, or maximum of probability density for continuous one.

For now we assume that the distribution IS discrete, and return the bin with the biggest hit count.

NOTE A better algorithm is still wanted. Experimental. Behavior may change in the future.

frequency_distribution_ref( \@index )

frequency_distribution_ref( $n )

frequency_distribution_ref

Return numbers of data point counts below each number in @index as hashref.

If a number is given instead of arrayref, @index is created by dividing [min, max] into $n intervals.

If no parameters are given, return previous result, if any.

Specific methods

The folowing methods only apply to this module, or are experimental.

bucket_width

Get bin width (relative to center of bin). Percentiles are off by no more than half of this. DEPRECATED.

log_base

Get upper/lower bound ratio for logarithmic bins. This represents relative precision of sample.

linear_width

Get width of linear buckets. This represents absolute precision of sample.

linear_threshold

Get absolute value threshold below which interpolation is switched to linear.

add_data_hash ( { value => weight, ... } )

Add values with counts/weights. This can be used to import data from other Statistics::Descriptive::LogScale object.

Returns self, so that methods can be chained.

Negative counts are allowed and treated as "forgetting" data. If a bin count goes below zero, such bin is simply discarded. Minus infinity weight is allowed and has the same effect. Data is guaranteed to remain consistent.

If incorrect data is given (i.e. non-numeric, undef, or +infinity), an exception is thrown and nothing changes.

NOTE Cache may be reset, even if no data was actually inserted.

NOTE It is possible to add infinite values to data pool. The module will try and calculate whatever can still be calculated. However, no portable way of serializing such values is done yet.

get_data_hash( %options )

Return distribution hashref {value => number of occurances}.

This is inverse of add_data_hash.

Options may include:

  • min - ignore values below this. (See find_boundaries)

  • max - ignore values above this. (See find_boundaries)

  • ltrim - ignore this % of values on lower end. (See find_boundaries)

  • rtrim - ignore this % of values on upper end. (See find_boundaries)

  • noise_thresh - strip bins with count below this.

TO_JSON()

Return enough data to recreate the whole object as an unblessed hashref.

This routine conforms with JSON::XS, hence the name. Can be called as

    my $str = JSON::XS->new->allow_blessed->convert_blessed->encode( $this );

NOTE This module DOES NOT require JSON::XS or serialize to JSON. It just deals with data. Use JSON::XS, YAML::XS, Data::Dumper or any serializer of choice.

    my $raw_data = $stat->TO_JSON;
    Statistics::Descriptive::LogScale->new( %$raw_data );

Would generate an exact copy of $stat object (provided it's S::D::L and not a subclass).

clone( [ %options ] )

Copy constructor - returns copy of an existing object. Cache is not preserved.

Constructor options may be given to override existing data. See new().

Trim options may be given to get partial data. See get_data_hash().

scale_sample( $scale )

Multiply all bins' counts by given value. This can be used to adjust significance of previous data before adding new data (e.g. gradually "forgetting" past data in a long-running application).

mean_of( $code, [$min, $max] )

Return expectation of $code over sample within given range.

$code is expected to be a pure function (i.e. depending only on its input value, and having no side effects).

The underlying integration mechanism only calculates $code once per bin, so $code should be stable as in not vary wildly over small intervals.

Experimental methods

These methods may be subject to change in the future, or stay, if they are good.

sum_of ( $code, [ $min, $max ] )

Integrate arbitrary function over the sample within the [ $min, $max ] interval. Default values for both limits are infinities of appropriate sign.

Values in the edge bins are cut using interpolation if needed.

NOTE: sum_of(sub{1}, $a, $b) would return rough nubmer of data points between $a and $b.

EXPERIMENTAL. The method name may change in the future.

histogram ( %options )

Returns array of form [ [ count0_1, x0, x1 ], [count1_2, x1, x2 ], ... ] where countX_Y is number of data points between X and Y.

Options may include:

  • count (+) - number of intervals to divide sample into.

  • index (+) - interval borders as array. Will be sorted before processing.

  • min - ignore values below this. Default = $self->min - epsilon.

  • max - ignore values above this. Default = $self->max + epsilon.

  • ltrim - ignore this % of values on lower end.

  • rtrim - ignore this % of values on upper end.

  • normalize_to <nnn> - adjust counts so that max number becomes nnn. This may be useful if one intends to draw pictures.

Either count or index must be present.

NOTE: this is equivalent to frequency_distribution_ref but better suited for omitting sample tails and outputting pretty pictures.

find_boundaries( %opt )

Return ($min, $max) of part of sample denoted by options.

Options may include:

  • min - ignore values below this. default = min() - epsilon.

  • max - ignore values above this. default = max() + epsilon.

  • ltrim - ignore this % of values on lower end.

  • rtrim - ignore this % of values on upper end.

If no options are given, the whole sample is guaranteed to reside between returned values.

format( "printf-like expression", ... )

Returns a summary as requested by format string. Just as with printf and sprintf, a placeholder starts with a %, followed by formatting options and a

The following placeholders are supported:

  • % - a literal %

  • s, f, g - a normal printf acting on an extra argument. The number of extra arguments MUST match the number of such placeholders, or this function dies.

  • n - count;

  • m - min;

  • M - max,

  • a - mean,

  • d - standard deviation,

  • S - skewness,

  • K - kurtosis,

  • q(x) - x-th quantile (requires argument),

  • p(x) - x-th percentile (requires argument),

  • P(x) - cdf - the inferred cumulative distribution function (x) (requires argument),

  • e(n) - central_moment - central moment of n-th power (requires argument),

  • E(n) - std_moment - standard moment of n-th power (requires argument),

  • A(n) - abs_moment - absolute moment of n-th power (requires argument).

For example,

    $stat->format( "99%% results lie between %p(0.5) and %p(99.5)" );

Or

    for( my $i = 0; $i < @stats; $i++ ) {
        print $stats[$i]->format( "%s-th average value is %a +- %d", $i );
    };

AUTHOR

Konstantin S. Uvarin, <khedin at gmail.com>

BUGS

The module is currently under development. There may be bugs.

mode() only works for discrete distributions, and simply returns the first bin with largest bin count. A better algorithm is wanted.

sum_of() should have been made a private method. Its signature and/or name may change in the future.

See the TODO file in the distribution package.

Please feel free to post bugs and/or feature requests to github: https://github.com/dallaylaen/perl-Statistics-Descriptive-LogScale/issues/new

Alternatively, you can use CPAN RT via e-mail bug-statistics-descriptive-logscale at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Statistics-Descriptive-LogScale.

Your contribution is appreciated.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Statistics::Descriptive::LogScale

You can also look for information at:

ACKNOWLEDGEMENTS

This module was inspired by a talk that Andrew Aksyonoff, author of Sphinx search software, has given at HighLoad++ conference in Moscow, 2012.

Statistics::Descriptive was and is used as reference when in doubt. Several code snippets were shamelessly stolen from there.

linear_width and linear_threshold parameter names were suggested by CountZero from http://perlmonks.org

LICENSE AND COPYRIGHT

Copyright 2013-2015 Konstantin S. Uvarin.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.