The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Statistics::Descriptive::Weighted - Module of basic descriptive statistical functions for weighted variates.

SYNOPSIS

  use Statistics::Descriptive::Weighted;

  $stat  = Statistics::Descriptive::Weighted::Full->new();
  
  $stat->add_data([1,2,3,4],[0.1,1,10,100]); ## weights are in second argument
  $mean  = $stat->mean();                    ## weighted mean
  $var   = $stat->variance();                ## weighted sample variance (unbiased estimator)
  $var   = $stat->biased_variance();         ## weighted sample variance (biased)
  
  $stat->add_data([3],[10]);                 ## statistics are updated as variates are added
  $vwt   = $stat->weight(3);                 ## returns 20, the weight of 3
  $wt    = $stat->weight();                  ## returns sum of weights, 121.1
  $ct    = $stat->count(3);                  ## returns 2, the number of times 3 was observed
  $ct    = $stat->count();                   ## returns 5, the total number of observations

  $med   = $stat->median();                  ## weighted sample median
  $mode  = $stat->mode();                    ## returns 4, value with the most weight
  $ptl   = $stat->quantile(.01);             ## returns 3, smallest value with cdf >= 1st %ile 
  $ptl   = $stat->percentile(1);             ## returns about 2.06, obtained by interpolation
  $cdf   = $stat->cdf(3);                    ## returns ECDF of 3   (about 17.4%)
  $cdf   = $stat->cdf(3.5);                  ## returns ECDF of 3.5 (about 17.4%, same as ECDF of 3)
  $sf    = $stat->survival(3);               ## returns complement of ECDF(3)   (about 82.6%)
  $pval  = $stat->rtp(4);                    ## returns right tail probability of 4 (100 / 121.1, about 82.6%)

  $min  = $stat->min();                      ## returns 1, the minimum
  $max  = $stat->max();                      ## returns 4, the maximum

  $unweighted  = Statistics::Descriptive::Full->new();
  $weighted    = Statistics::Descriptive::Weighted::Full->new();

  $unweighted->add_data(1,1,1,1,7,7,7,7);
  $weighted->add_data([1,7],[4,4]);

  $ct = $unweighted->count();                ## returns 8 
  $ct = $weighted->count();                  ## returns 2 

  print "false, variances unequal!\n" unless 
         ( abs($unweighted->variance() - $weighted->variance()) < 1e-12 );
 
  ## the above statement will print, the variances are truly unequal
  ## the unweighted variance is corrected in terms of sample-size,
  ## while the weighted variance is corrected in terms of the sum of
  ## squared weights

  $data = $weighted->get_data();     

  ## the above statement returns a hashref with keys:
  ## 'vars','weights','uniqvars','counts','sumweights','cdfs','rtps','percentiles','order'

  $weighted->print();                 

  ## prints the hashref above with Data::Dumper

DESCRIPTION

This module partially extends the module Statistics::Descriptive to handle weighted variates. Like that module, this module has an object-oriented design and supports two different types of data storage and calculation objects: sparse and full. With the sparse object representation, none of the data is stored and only a few statistical measures are available. Using the full object representation, complete information about the dataset (including order of observation) is retained and additional functions are available.

This module represents numbers in the same way Perl does on your architecture, relying on Perl's own warnings and assertions regarding underflow and overflow errors, division by zero, etc. The constant $Statistics::Descriptive::Tolerance is not used. Caveat programmor.

Variance calculations, however, are designed to avoid numerical problems. "Online" (running sums) approaches are used to avoid catastrophic cancellation and other problems. New in versions 0.4 and up, I have corrected the definition of the "variance" and "standard_deviation" functions to standard definitions. This module now models the same calculation as eg the "corpcor" package in R for weighted sample variance. Following convention from Statistics::Descriptive, "variance" and "standard_deviation" return unbiased "sample" estimators. Also new in v0.4, I now provide "biased_variance" and "biased_standard_deviation" functions to return the biased estimators. Please see below for full definitions.

Like in Statistics::Descriptive any of the methods (both Sparse and Full) cache values so that subsequent calls with the same arguments are faster.

Be warned that this is not a drop-in replacement for Statistics::Descriptive. The interfaces are different for adding data, and also for retrieving data with get_data. Certain functions from Statistics::Descriptive have been dropped, specifically:

Statistics::Descriptive::Sparse::mindex()
Statistics::Descriptive::Sparse::maxdex()
Statistics::Descriptive::Full::sort_data()
Statistics::Descriptive::Full::presorted()
Statistics::Descriptive::Full::harmonic_mean()
Statistics::Descriptive::Full::geometric_mean()
Statistics::Descriptive::Full::trimmed_mean()
Statistics::Descriptive::Full::frequency_distribution()
Statistics::Descriptive::Full::least_squares_fit()

Calling these functions on Statistics::Descriptive::Weighted objects will cause programs to die with a stack backtrace.

With this module you can recover the data sorted from get_data(). Data is sorted automatically on insertion.

The main extension and focus of this module was to implement a cumulative distribution function and a right-tail probability function with efficient search performance, even if the data added is already sorted. This is achieved using a partially randomized self-balancing tree to store data. The implementation uses Tree::Treap v. 0.02 written by Andrew Johnson.

METHODS

Sparse Methods

$stat = Statistics::Descriptive::Weighted::Sparse->new();

Create a new sparse statistics object.

$stat->add_data([1,2,3],[11,9,2]);

Adds data to the statistics object. The cached statistical values are updated automatically.

This function expects one or two array references: the first points to variates and the second to their corresponding weights. The referenced arrays must be of equal lengths. The weights are expected to all be positive. If any weights are not positive, the module will carp (complain to standard error) and the corresponding variates will be skipped over.

If the weights array is omitted, all weights for the values added are assumed to be 1.

Variates may be added in multiple instances to Statistics objects, and their summaries are calculated "on-line," that is updated.

$stat->count();

Returns the number of variates that have been added.

$stat->weight();

Returns the sum of the weight of the variates.

$stat->sum();

Returns the sum of the variates multiplied by their weights.

$stat->mean();

Returns the weighted mean of the data. This is the sum of the weighted data divided by the sum of weights.

$stat->variance();

Returns the unbiased weighted sample variance of the data. An "on-line" weighted incremental algorithm for variance is based on D. H. D. West (1979). Communications of the ACM, 22, 9, 532-535: Updating Mean and Variance Estimates: An Improved Method. However, instead of dividing by (n-1) as in that paper, the bias correction used is:

    1 / (1 - (sum_i ((w_i)^2) / (sum_i w_i)^2)),

where w_i is the ith weight. This bias correction factor multiplies the biased estimator of the variance defined below.

$stat->standard_deviation();

Returns the square root of the unbiased weighted sample variance of the data.

$stat->biased_variance();

Returns the biased weighted sample variance of the data. The same "on-line" weighted incremental algorithm for variance is used. The definition of the biased weighted variance estimator is:

    sum_i (w_i * (x_i - mean_x)^2) / sum_i (w_i),

where w_i is the weight of the ith variate x_i, and mean_x is the weighted mean of the variates. To reproduce the variance calculation of earlier versions of this module, multiple the biased variance by ($stat->count() / ($stat->count() - 1)).

$stat->biased_standard_deviation();

Returns the square root of the unbiased weighted sample variance of the data.

$stat->min();

Returns the minimum value of the data set.

$stat->max();

Returns the maximum value of the data set.

$stat->sample_range();

Returns the sample range (max - min) of the data set.

Full Methods

Similar to the Sparse Methods above, any Full Method that is called caches the current result so that it doesn't have to be recalculated.

$stat = Statistics::Descriptive::Weighted::Full->new();

Create a new statistics object that inherits from Statistics::Descriptive::Sparse so that it contains all the methods described above.

$stat->add_data([1,2,4,5],[2,2,2,5]);

Adds weighted data to the statistics object. All of the sparse statistical values are updated and cached. Cached values from Full methods are deleted since they are no longer valid.

Note: Calling add_data with an empty array will delete all of your Full method cached values! Cached values for the sparse methods are not changed

$stat->mode();

Returns the data value with the most weight. In the case that a data value is observed multiple times, their successive weights are summed of course.

$stat->maxweight();

The weight of the mode.

$stat->count(10);

The number of observations of a particular data value.

$stat->weight(10);

The total weight of a particular data value.

$x = $stat->cdf(4);

Returns the weighted empirical cumulative distribution function (ECDF).

  • For example, given the 6 measurements:

    -2, 7, 7, 4, 18, -5

    with weights:

    2, 1, 1, 2, 2, 2

    Let F(x) be the ECDF of x, which is defined as the sum of all normalized weights of all observed variates less than or equal to x.

    Then F(-8) = 0, F(-5.0001) = 0, F(-5) = 1/5, F(-4.999) = 1/5, F(7) = 4/5, F(18) = 1, F(239) = 1.

    Note that we can recover the different measured values and how many times each occurred from F(x) -- no information regarding the range in values is lost. Summarizing measurements using histograms, on the other hand, in general loses information about the different values observed, so the EDF is preferred.

    Using either the EDF or a histogram, however, we do lose information regarding the order in which the values were observed. Whether this loss is potentially significant will depend on the metric being measured.

(Modified from: pod from Statistics::Descriptive, itself taken from RFC2330 - Framework for IP Performance Metrics, Section 11.3. Defining Statistical Distributions. RFC2330 is available from: http://www.cis.ohio-state.edu/htbin/rfc/rfc2330.html.)

$x = $stat->survival(8);

Complement of the weighted cdf function, also known as the weighted survival function. The weighted survival function S(x) is the sum of all normalized weights of all observed variates greater than x.

  • For example, given the 6 measurements:

    -2, 7, 7, 4, 18, -5

    with weights:

    2, 1, 1, 2, 2, 2

    Then S(-8) = 1, S(-5.0001) = 1, S(-5) = 4/5, S(-4.999) = 4/5, S(7) = 1/5, S(18) = 0, S(239) = 0.

$x = $stat->rtp(8);

The weighted right tail probability function. The weighted right tail probability function P(x) is the sum of all normalized weights of all observed variates greater than or equal to x. This may be useful for Monte Carlo estimation of P-values.

  • For example, given the 6 measurements:

    -2, 7, 7, 4, 18, -5

    with weights:

    2, 1, 1, 2, 2, 2

    Then P(-8) = 1, P(-5.0001) = 1, P(-5) = 1, P(-4.999) = 4/5, P(7) = 2/5, P(18) = 1/5, P(239) = 0.

$x = $stat->quantile(0.25);

Returns the weighted quantile. This is the inverse of the weighted ECDF function. It is only defined for arguments between 0 and 1 inclusively. If F(x) is the ECDF, then the weighted quantile function G(y) returns the smallest variate x whose weighted ECDF F(x) is greater than or equal to y.

  • For example, given the 6 measurements:

    -2, 7, 7, 4, 18, -5

    with weights:

    2, 1, 1, 2, 2, 2

    Then G(0) = -5, G(0.1) = -5, G(0.2) = -5, G(0.25) = -2, G(0.4) = -2, G(0.8) = 7, G(1) = 18.

$x = $stat->percentile(25);

Returns the weighted percentile. It is only defined for arguments between 0 and 100 inclusively. Unlike the quantile function above, the percentile function performs weighted linear interpolation between variates unless the argument exactly equals the computed percentile of one of the variates.

  • Define p_n to be the percentile of the nth sorted variate, written v_n, like so:

    p_n = 100/S_N * (S_n - (w_n / 2)),

    where S_N is the sum of all weights, S_n is the partial sum of weights up to and including the nth variate, and w_n is the weight of the nth variate.

    Given a percent value 0 <= y <= 100, find an integer k such that:

    p_k <= y <= p_(k+1).

    The percentile function P(y) may now be defined:

    P(y) = v_k + {[(y - p_k) / (p_(k+1) - p_k)] * (v_(k+1) - v_k)}

This definition of weighted percentile was taken from: http://en.wikipedia.org/wiki/Percentile on Dec 15, 2008.

$stat->median();

This is calculated as $stat->percentile(50) and cached as necessary.

$stat->get_data();

Returns a data structure that reconstitutes the original data added to the object, supplemented by some of the distributional summaries. Returns a reference to a hash, with the following keys, each pointing to a reference to an array containing the indicated data.

vars

The observed variates, sorted.

weights

The weights of the variates (in corresponding order to the value of 'vars').

order

The order of addition of the variates (in corresponding order to the value of 'vars').

uniqvars

The uniquely observed variates, sorted.

counts

The numbers of times each variate was observed (in corresponding order to the value of 'uniqvars').

sumweights

The total weight of each unique variate (in corresponding order to the value of 'uniqvars').

cdfs

The cdf of each unique variate (in corresponding order to the value of 'uniqvars').

rtps

The rt tail probabilities of each unique variate (in corresponding order to the value of 'uniqvars').

percentiles

The percentiles of each unique variate (see "percentile" above for definition, given in corresponding order to the value of 'uniqvars').

$stat->print();

Prints a Data::Dumper dump of the hashref returned by get_data().

REPORTING ERRORS

When reporting errors, please include the following to help me out:

  • Your version of perl. This can be obtained by typing perl -v at the command line.

  • Which versions of Statistics::Descriptive and Statistics::Descriptive::Weighted you're using.

  • Details about what the error is. Try to narrow down the scope of the problem and send me code that I can run to verify and track it down.

NOTES

I use a running sum approach for the bias correction factor. We may write this factor as (1 / (1 - H)),

where

    H is 1 / (1 - (sum_i ((w_i)^2) / (sum_i w_i)^2)).

The calculation I use for calculation of the (n+1)th value of H, on encountering the (n+1)th variate is:

    H_(n+1) = (sum_i^n w_i)^2 * H_n + w_(n+1)^2) / (sum_i^(n+1) w_i)^2

together with initial value:

    H_0 = 0.

AUTHOR

David H. Ardell

dhard@cpan.org (or just ask Google).

THANKS

Florent Angly

who contributed bug fixes, added features and tests, and improved installation statistics (Oct 2009).

REFERENCES

COPYRIGHT

Copyright (c) 2008,2009 David H. Ardell.

Copyright (c) 2009 Florent Angly.

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.

Portions of this code are from Statistics::Descriptive which is under the following copyrights.

Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Copyright (c) 1994,1995 Jason Kastner. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

REVISION HISTORY

v.0.5

October 2009. Fixed installation/test errors. Weights array made optional.

v.0.4

January 2009. Redefinition of variance and standard_deviation to standard definitions; introduction of biased_variance, biased_standard_deviation functions

v.0.2-v.0.3

December 2008. Corrections made to installation package.

v.0.1

December 2008. Initial release under perl licensing.