Konstantin S. Uvarin

NAME

Statistics::Descriptive::LogScale - Memory-efficient approximate descriptive statistics class.

VERSION

Version 0.06

SYNOPSIS

The basic usage is roughly the same as that of Statistics::Descriptive::Full.

    use Statistics::Descriptive::LogScale;
    my $stat = Statistics::Descriptive::LogScale->new ();

    while(<>) {
        chomp;
        $stat->add_data($_);
    };

    # This can also be done in O(1) memory, precisely
    printf "Mean: %f +- %f\n", $stat->mean, $stat->standard_deviation;
    # This requires storing actual data, or approximating
    printf "Median: %f\n", $stat->median;

DESCRIPTION

This module aims at providing some advanced statistical functions without storing all data in memory, at the cost of certain (predictable) precision loss.

Data is represented by a set of logarithmic buckets only storing counters. Data with absolute value below certain threshold (which may be zero) is stored in a special zero counter.

All operations are performed on the buckets, introducing relative error which does not, however, exceed the buckets' width ("base").

METHODS

new( %options )

%options may include:

  • base - ratio of adjacent buckets. Default is 10^(1/48), which gives 5% precision and exact decimal powers.

  • zero_thresh - absolute value threshold below which everything is considered zero.

General statistical methods

These methods are used to query the distribution properties. They generally follow the interface of Statistics::Descriptive and co, with minor additions.

clear

Destroy all stored data.

add_data( @data )

Add numbers to the data pool.

count

Returns number of data points.

min

max

Values of minimal and maximal buckets.

NOTE: Due to rounding, some of the actual inserted values may fall outside of the min..max range. This may change in the future.

sample_range

Return sample range of the dataset, i.e. max() - min().

sum

Return sum of all data points.

sumsq

Return sum of squares of all datapoints.

mean

Return mean, which is sum()/count().

variance

Return data variance, i.e. E((x - E(x)) ** 2).

standard_deviation

std_dev

Return standard deviation (square root of variance).

cdf ($x)

Cumulative distribution function. Returns estimated probability of random data point from the sample being less than $x.

As a special case, cdf(0) accounts for half of zeroth bucket count (if any).

Not present in Statistics::Descriptive::Full, but appears in Statistics::Descriptive::Weighted.

percentile( $n )

Find $n-th percentile, i.e. a value below which lies $n % of the data.

0-th percentile is by definition -inf and is returned as undef (see Statistics::Descriptive).

$n is a real number, not necessarily integer.

quantile( 0..4 )

From Statistics::Descriptive manual:

  0 => zero quartile (Q0) : minimal value
  1 => first quartile (Q1) : lower quartile = lowest cut off (25%) of data = 25th percentile
  2 => second quartile (Q2) : median = it cuts data set in half = 50th percentile
  3 => third quartile (Q3) : upper quartile = highest cut off (25%) of data, or lowest 75% = 75th percentile
  4 => fourth quartile (Q4) : maximal value

median

Return median of data, a value that divides the sample in half. Same as percentile(50).

trimmed_mean( $ltrim, [ $utrim ] )

Return mean of sample with $ltrim and $utrim fraction of data points remover from lower and upper ends respectively.

ltrim defaults to 0, and rtrim to ltrim.

harmonic_mean

Return harmonic mean of the data, i.e. 1/E(1/x).

Return undef if division by zero occurs (see Statistics::Descriptive).

geometric_mean

Return geometric mean of the data, that is, exp(E(log x)).

Dies unless all data points are of the same sign.

skewness

Return skewness of the distribution, calculated as n/(n-1)(n-2) * E((x-E(x))**3)/std_dev**3 (this is consistent with Excel).

kurtosis

Return kurtosis of the distribution, that is 4-th standardized moment - 3. The exact formula used here is consistent with that of Excel and Statistics::Descriptive.

central_moment( $n )

Return $n-th central moment, that is, E((x - E(x))^$n).

Not present in Statistics::Descriptive::Full.

std_moment( $n )

Return $n-th standardized moment, that is, E((x - E(x))**$n) / std_dev(x)**$n.

Not present in Statistics::Descriptive::Full.

mode

Mode of a distribution is the most common value for a discrete distribution, or maximum of probability density for continuous one. We assume the distribution IS continuous, as we're already approximating.

So we count probability density by smoothing hit counts in nearest nonempty intervals to stabilize it a little.

NOTE A better algorithm is wanted. Experimental.

NOTE Testing shows mode fairly unstable around zero, e.g. normal distribution (10,10) returns mode close to 0.

frequency_distribution_ref( \@index )

frequency_distribution_ref( $n )

frequency_distribution_ref

Return numbers of data point counts below each number in @index as hashref.

If a number is given instead of arrayref, @index is created by dividing [min, max] into $n intervals.

If no parameters are given, return previous result, if any.

Specific methods

The folowing methods only apply to this module, or are experimental.

bucket_width

Get bucket width (relative to center of bucket). Percentiles are off by no more than half of this.

zero_threshold

Get zero threshold. Numbers with absolute value below this are considered zeroes.

add_data_hash ( { value => weight, ... } )

Add values with weights. This can be used to import data from other Statistics::Descriptive::LogScale object.

get_data_hash

Return distribution hashref {value => number of occurances}.

This is inverse of add_data_hash.

scale_sample( $scale )

Multiply all buckets' counts by given value. This can be used to adjust significance of previous data before adding new data (e.g. gradually "forgetting" past data in a long-running application).

mean_of( $code, [$min, $max] )

Return expectation of $code over sample within given range.

$code is expected to be a pure function (i.e. depending only on its input value, and having no side effects).

The underlying integration mechanism only calculates $code once per bucket, so $code should be stable as in not vary wildly over small intervals.

Experimental methods

These methods may be subject to change in the future, or stay, if they are good.

sum_of ( $code, [ $min, $max ] )

Integrate arbitrary function over the sample within the [ $min, $max ] interval. Default values for both limits are infinities of appropriate sign.

Values in the edge buckets are cut using interpolation if needed.

NOTE: sum_of(sub{1}, $a, $b) would return rough nubmer of data points between $a and $b.

EXPERIMENTAL. The method name may change in the future.

histogram ( %options )

Returns array of form [ [ count0_1, x0, x1 ], [count1_2, x1, x2 ], ... ] where countX_Y is number of data points between X and Y.

Options may include:

  • count (+) - number of intervals to divide sample into.

  • index (+) - interval borders as array. Will be sorted before processing.

  • min - ignore values below this. default = max + epsilon.

  • max - ignore values above this. default = min - epsilon.

  • ltrim - ignore this % of values on lower end.

  • rtrim - ignore this % of values on upper end.

Either count or index must be present.

NOTE: this is equivalent to frequency_distribution_ref but better suited for omitting sample tails and outputting pretty pictures.

find_boundaries( %opt )

Return ($min, $max) of part of sample denoted by options.

Options may include:

  • min - ignore values below this. default = max + epsilon.

  • max - ignore values above this. default = min - epsilon.

  • ltrim - ignore this % of values on lower end.

  • rtrim - ignore this % of values on upper end.

If no options are given, the whole sample is guaranteed to reside between returned values.

AUTHOR

Konstantin S. Uvarin, <khedin at gmail.com>

BUGS

The module is currently in alpha stage. There may be bugs.

mode() is unstable around zero, better algorithm wanted.

sum_of() requires more extensive unit testing.

Adding linear interpolation could result in precision gains at a little performance cost.

Please report any bugs or feature requests to bug-statistics-descriptive-logscale at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Statistics-Descriptive-LogScale. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Statistics::Descriptive::LogScale

You can also look for information at:

ACKNOWLEDGEMENTS

This module was inspired by a talk that Andrew Aksyonoff, author of Sphinx search software, has given at HighLoad++ conference in Moscow, 2012.

Statistics::Descriptive was and is used as reference when in doubt. Several code snippets were shamelessly stolen from there.

LICENSE AND COPYRIGHT

Copyright 2013 Konstantin S. Uvarin.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.