The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Statistics::KernelEstimation - Kernel Density Estimates and Histograms

SYNOPSIS

  use Statistics::KernelEstimation;

  $s = Statistics::KernelEstimation->new();

  for $x ( @data ) {
    $s->add_data( $x );
  }

  $w = $s->default_bandwidth();
  ( $min, $max ) = $s->extended_range();
  for( $x=$min; $x<=$max; $x+=($max-$min)/100 ) {
    print $x, "\t", $s->pdf( $x, $w ), "\t", $s->cdf( $x, $w ), "\n";
  }

  # Alternatively:
  @histo = $s->histogram( 10 );            # 10 bins
  for( @histo ) {
    print $_->{pos}, "\t", $_->{cnt}, "\n";
  }

  @cumul = $s->distribution_function();
  for( @cumul ) {
    print $_->{pos}, "\t", $_->{cnt}, "\n";
  }

DESCRIPTION

This modules calculates Kernel Density Estimates and related quantities for a collection of random points.

A Kernel Density Estimate (KDE) is similar to a histogram, but improves on two known problems of histograms: it is smooth (whereas a histogram is ragged) and does not suffer from ambiguity in regards to the placement of bins.

In a KDE, a smooth, strongly peaked function is placed at the location of each point in the collection, and the contributions from all points is summed. The resulting function is a smooth approximation to the probability density from which the set of points was drawn. The smoothness of the resulting curve can be controlled through a bandwidth parameter. (More information can be found in the books listed below.)

This module calculates KDEs as well as Cumulative Density Functions (CDF). Three different kernels are available (Gaussian, Box, Epanechnikov). The module also offers limited support for bandwidth optimization.

Finally, the module can generate "classical" histograms and distribution functions.

Limitations

This module is intended for small to medium-sized data sets (up to a few hundreds or thousands of points). It is not intended for very large data sets, or for multi-dimenstional data sets. (Although nothing prevents applications to huge data sets, performance is likely to be poor.)

METHODS

Instantiation

A calculator object must be instantiated before it can be used. The choice of kernel is fixed at instantiation time and can not be changed.

$s = Statistics::KernelEstimation->new()
$s = Statistics::KernelEstimation->new_gauss()

Both create a calculator object using the Gaussian kernel:

  exp(-0.5 ((x-m)/s))**2 )/sqrt(2 pi s**2).
$s = Statistics::KernelEstimation->new_box()

Creates a calculator object using the box kernel:

  1/s if abs( (x-m)/s ) < 1
  0 otherwise
$s = Statistics::KernelEstimation->new_epanechnikov()

Creates a calculator object using the Epanechnikov kernel:

  0.75*(1-((x-m)/s)**2)/s if abs( (x-m)/s ) < 1
  0 otherwise

Adding Data

$s->add_data( $x )

Add a single data point at position $x. $x must be a number.

$s->add_data( $x, $y )

Add a data point at position $x with weight $y. $y must be a non-negative number. The following two statements are ((almost)) identical:

  for( 1..5 ) { $s->add_data( $x ); }
  $s->add_data( $x, 5 );
$s->add_data( $x, $y, $w )

Add a data point at position $x with weight $y and a bandwidth $w, which is specific to this data point. $w must be a positive number. The data specific bandwidth is only meaningful if used with the functions pdf_width_from_data() and cdf_width_from_data() (cf. below).

Accessing Information

$n = $s->count()

Returns the sum of the weights for all data points. If using the default weights (ie, if only using $s->add_data( $x ) ), this is equal to the number of data points.

$max = $s->range()
( $min, $max ) = $s->range()
$emax = $s->extended_range()
( $emin, $emax ) = $s->extended_range()

The $s->range() function returns the minimum and maximum of all data points. For the box and the Epanechnikov kernel, the $s->extended_range() function is identical to range(), but for the Gaussian kernel, the extended range is padded on both sides by 5 * default_bandwidth. The extended range is chosen such that the kernel density estimate will have fallen to near zero at the ends of the extended range.

Both functions return an array containing ( min, max ) in array context, and only the max in scalar context.

$bw = $s->default_bandwidth()

The "default bandwidth" is the bandwidth that would be optimal if the data set was taken from a Normal distribution. It is equal to sigma * ( 4/(3*n) )**(1/5), where n is the weighted number of data points (as returned by $s->count()) and sigma is the standard deviation for the set of points.

For most data sets, the default bandwidth is too wide, leading to an oversmoothed kernel estimate.

Kernel Estimates

Probability density functions (PDF) are normalized, such that the integral over all of space for a pdf equals 1. Cumulative density functions (CDF) are normalized such that as x -> infty, cdf -> 1. A CDF is the integral of a PDF from -infty to x.

$s->pdf( $x, $w )

Calculates the kernel density estimate (probability density function) at position $x and for kernel bandwidth $w. If $w is omitted, the default bandwidth is used.

$s->cdf( $x, $w )

Calculates the cumulative distribution function based on the kernel estimate at position $x with kernel bandwidth $w. If $w is omitted, the default bandwidth is used.

$s->pdf_width_from_data( $x )

Calculates the kernel density estimate (probability density function) at position $x, using the bandwidth that was specified in the call to $s->add_data( $x, $w ). If no bandwidth was specified for a data point, a warning is issued and the default bandwidth is used instead. Any additional arguments (besides $x) to $s->pdf_width_from_data( $x ) are ignored.

$s->cdf_width_from_data( $x )

Calculates the cumulative distribution function based on the kernel estimate at position $x, using the bandwidth that was specified in the call to $s->add_data( $x, $w ). If no bandwidth was specified for a data point, a warning is issued and the default bandwidth is used instead. Any additional arguments (besides $x) to $s->cdf_width_from_data( $x ) are ignored.

Bandwidth Optimization

This module contains limited support for optimal bandwidth selection. The $s->optimal_bandwidth() function returns a value for the kernel bandwidth which is optimal in the AMISE (Asymptotic Mean Square Error) sense. Check the literature listed below for details.

Bandwidth optimization is an expensive process! Even for moderately sized data sets (a few hundred points), the process can take a while (5-30 seconds), more for larger data sets.

The optimal bandwidth is the solution to a self-consistent equation. This module's implementation solves this equation using an iterative method (secant method). The usual warnings about possible non-convergence apply.

Bandwidth optimization is only available for the Gaussian kernel.

$s->optimal_bandwidth()

Returns the optimal bandwidth in an AMISE sense in scalar context. In array context, returns a two element array ( $bw, $n ), where the first element is the optimal bandwidth, and the second element is the number of steps required to achieve convergence in the secant method.

$s->optimal_bandwidth( $n, $eps )

The same as $s->optimal_bandwidth(), but setting explicitly the maximum number of iteration steps in the secant method $n (default: $n=25), and the relative final residual $eps (default: $eps=1e-3).

Classical Histograms and Distribution Functions

This module contains basic support for "classical" histograms and distribution functions.

Histograms are normalized in such a way that the sum over all bins equals the return value of $s->count(). Distribution functions are normalized in such a way that the right-most value equals $s-count().

$r = $s->histogram( $bins )

Given a number $bins of bins, this function returns a histogram of the data set as a ref to an array of hashes. Each element in the array is a two element hash: { pos => $x, cnt => $y }. Here, the value of the pos element is the position of the center of the bin, whereas the value of the cnt element is the weighted number of points in the bin. (If only the default weight has been used for all points, (ie. $s->add_data( $x ) without explicit weight), then the value of the cnt element is the number of data points in this bin.

The returned array is sorted in ascending order of bin positions.

The first bin is centered at the smallest data point, the last bin is centered at the greatest data point. All bins have the same width, namely ($max-$min)/($bins-1). If $bins==1, a single bin is returned, centered at the half-way point between $min and $max.

There is no support for histograms with variable bin widths, nor for any choice in the placement of bins (flush left/right vs. centered).

$r = $s->distribution_function()

Returns the cumulative distribution function as a ref to an array of hashes. Each element in the array is a two element hash: { pos => $x, cnt => $y }. Here, the value of the pos element gives the x-coordinate and the value of the cnt element gives the corresponding y-value of the cumulative distribution function.

The returned array is sorted in ascending order of x-coordinates.

The cumulative distribution function is obtained by first sorting all data points according to their location, and then summing their weights from left to right. As a consequence, the number of elements in the array returned from this function equals the number of calls to $s->add_data( $x ). If multiple calls have been made to $s->add_data( $x ) with the same value of $x, the distribution function will have multiple entries with the same x-coordinate, but increasing y-value.

SEE ALSO

For descriptive summary statistics, check out the Statistics::Descriptive module on CPAN.

Good information on Kernel Density Estimatino can be found in the following books (in descending order of detail):

Kernel Smoothing

by M.P. Wand and M.C. Jones, Chapman and Hall, 1995

Applied Smoothing Techniques for Data Analysis

by A. W. Bowman and A. Azzalini, Oxford University Press, 1997

All of Statistics

by Larry Wasserman, Springer, 2004

AUTHOR

Philipp K. Janert, <janert at ieee dot org>, http://www.beyondcode.org

COPYRIGHT AND LICENSE

Copyright (C) 2008 by Philipp K. Janert

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.