Philipp K. Janert

# NAME

Statistics::KernelEstimation - Kernel Density Estimates and Histograms

# SYNOPSIS

``````  use Statistics::KernelEstimation;

\$s = Statistics::KernelEstimation->new();

for \$x ( @data ) {
}

\$w = \$s->default_bandwidth();
( \$min, \$max ) = \$s->extended_range();
for( \$x=\$min; \$x<=\$max; \$x+=(\$max-\$min)/100 ) {
print \$x, "\t", \$s->pdf( \$x, \$w ), "\t", \$s->cdf( \$x, \$w ), "\n";
}

# Alternatively:
@histo = \$s->histogram( 10 );            # 10 bins
for( @histo ) {
print \$_->{pos}, "\t", \$_->{cnt}, "\n";
}

@cumul = \$s->distribution_function();
for( @cumul ) {
print \$_->{pos}, "\t", \$_->{cnt}, "\n";
}``````

# DESCRIPTION

This modules calculates Kernel Density Estimates and related quantities for a collection of random points.

A Kernel Density Estimate (KDE) is similar to a histogram, but improves on two known problems of histograms: it is smooth (whereas a histogram is ragged) and does not suffer from ambiguity in regards to the placement of bins.

In a KDE, a smooth, strongly peaked function is placed at the location of each point in the collection, and the contributions from all points is summed. The resulting function is a smooth approximation to the probability density from which the set of points was drawn. The smoothness of the resulting curve can be controlled through a bandwidth parameter. (More information can be found in the books listed below.)

This module calculates KDEs as well as Cumulative Density Functions (CDF). Three different kernels are available (Gaussian, Box, Epanechnikov). The module also offers limited support for bandwidth optimization.

Finally, the module can generate "classical" histograms and distribution functions.

## Limitations

This module is intended for small to medium-sized data sets (up to a few hundreds or thousands of points). It is not intended for very large data sets, or for multi-dimensional data sets. (Although nothing prevents applications to huge data sets, performance is likely to be poor.)

# METHODS

## Instantiation

A calculator object must be instantiated before it can be used. The choice of kernel is fixed at instantiation time and can not be changed.

\$s = Statistics::KernelEstimation->new()
\$s = Statistics::KernelEstimation->new_gauss()

Both create a calculator object using the Gaussian kernel:

``  exp(-0.5 ((x-m)/s))**2 )/sqrt(2 pi s**2).``
\$s = Statistics::KernelEstimation->new_box()

Creates a calculator object using the box kernel:

``````  1/s if abs( (x-m)/s ) < 1
0 otherwise``````
\$s = Statistics::KernelEstimation->new_epanechnikov()

Creates a calculator object using the Epanechnikov kernel:

``````  0.75*(1-((x-m)/s)**2)/s if abs( (x-m)/s ) < 1
0 otherwise``````

Add a single data point at position \$x. \$x must be a number.

Add a data point at position \$x with weight \$y. \$y must be a non-negative number. The following two statements are ((almost)) identical:

``````  for( 1..5 ) { \$s->add_data( \$x ); }

Add a data point at position \$x with weight \$y and a bandwidth \$w, which is specific to this data point. \$w must be a positive number. The data specific bandwidth is only meaningful if used with the functions pdf_width_from_data() and cdf_width_from_data() (cf. below).

## Accessing Information

\$n = \$s->count()

Returns the sum of the weights for all data points. If using the default weights (ie, if only using \$s->add_data( \$x ) ), this is equal to the number of data points.

\$max = \$s->range()
( \$min, \$max ) = \$s->range()
\$emax = \$s->extended_range()
( \$emin, \$emax ) = \$s->extended_range()

The \$s->range() function returns the minimum and maximum of all data points. For the box and the Epanechnikov kernel, the \$s->extended_range() function is identical to range(), but for the Gaussian kernel, the extended range is padded on both sides by 5 * default_bandwidth. The extended range is chosen such that the kernel density estimate will have fallen to near zero at the ends of the extended range.

Both functions return an array containing ( min, max ) in array context, and only the max in scalar context.

\$bw = \$s->default_bandwidth()

The "default bandwidth" is the bandwidth that would be optimal if the data set was taken from a Normal distribution. It is equal to sigma * ( 4/(3*n) )**(1/5), where n is the weighted number of data points (as returned by \$s->count()) and sigma is the standard deviation for the set of points.

For most data sets, the default bandwidth is too wide, leading to an oversmoothed kernel estimate.

## Kernel Estimates

Probability density functions (PDF) are normalized, such that the integral over all of space for a pdf equals 1. Cumulative density functions (CDF) are normalized such that as x -> infty, cdf -> 1. A CDF is the integral of a PDF from -infty to x.

\$s->pdf( \$x, \$w )

Calculates the kernel density estimate (probability density function) at position \$x and for kernel bandwidth \$w. If \$w is omitted, the default bandwidth is used.

\$s->cdf( \$x, \$w )

Calculates the cumulative distribution function based on the kernel estimate at position \$x with kernel bandwidth \$w. If \$w is omitted, the default bandwidth is used.

\$s->pdf_width_from_data( \$x )

Calculates the kernel density estimate (probability density function) at position \$x, using the bandwidth that was specified in the call to \$s->add_data( \$x, \$w ). If no bandwidth was specified for a data point, a warning is issued and the default bandwidth is used instead. Any additional arguments (besides \$x) to \$s->pdf_width_from_data( \$x ) are ignored.

\$s->cdf_width_from_data( \$x )

Calculates the cumulative distribution function based on the kernel estimate at position \$x, using the bandwidth that was specified in the call to \$s->add_data( \$x, \$w ). If no bandwidth was specified for a data point, a warning is issued and the default bandwidth is used instead. Any additional arguments (besides \$x) to \$s->cdf_width_from_data( \$x ) are ignored.

## Bandwidth Optimization

This module contains limited support for optimal bandwidth selection. The \$s->optimal_bandwidth() function returns a value for the kernel bandwidth which is optimal in the AMISE (Asymptotic Mean Square Error) sense. Check the literature listed below for details. Bandwidth optimization is only available for the Gaussian kernel.

Bandwidth optimization is an expensive process! Even for moderately sized data sets (a few hundred points), the process can take a while (5-30 seconds), more for larger data sets.

The optimal bandwidth is the solution to a self-consistent equation. This module provides two different algorithms to solve this equation: one fast, the other one guaranteed safe.

The fast algorithm uses the secant method and should be tried first, in particular for larger data sets (hundreds of points). As with all iterative non-linear equation solvers, it is not guaranteed to converge.

The safe algorithm uses the bisection method. With properly chosen end-points, the bisection method is guaranteed to converge. It is slower (by about a factor of 3), compared to the secant method. For smaller data sets (less than hundred points), the difference in speed is imperceptible.

The bandwidth optimization routines in this module use iterative algorithms to solve a non-linear equation. Several parameters are provided which can be used to fine-tune the behavior of these routines in case the defaults are not sufficient to achieve the desired convergence. You can consult any standard reference on numerical analysis for the meaning of these parameters and how to use them. (The chapter on non-linear equations in "Numerical Recipes" by Press, Teukolsky, Vetterling, Flannery is sufficient.)

\$s->optimal_bandwidth()

Finds the optimal bandwidth in an AMISE sense using the secant method. Returns the value for the optimal bandwidth in scalar context. In array context, returns a two element array ( \$bw, \$n ), where the first element is the optimal bandwidth, and the second element is the number of steps required to achieve convergence in the secant method.

\$s->optimal_bandwidth( \$n, \$eps )

The same as \$s->optimal_bandwidth(), but setting explicitly the maximum number of iteration steps in the secant method \$n (default: \$n=25), and the relative final residual \$eps (default: \$eps=1e-3).

\$s->optimal_bandwidth_safe()

Finds the optimal bandwidth in an AMISE sense using the bisection method. Returns the value for the optimal bandwidth in scalar context. In array context, returns a two element array ( \$bw, \$n ), where the first element is the optimal bandwidth, and the second element is the number of steps required to achieve convergence in the bisection method.

\$s->optimal_bandwidth_safe( \$x1, \$x2, \$eps )

The same as \$s->optimal_bandwidth_safe(), but setting explicitly the desired relative accuracy of the result (default: \$eps=1e-3), and the two end-points of the bisection interval (defaults: \$x1=default_bandwidth/count, \$x2=2*default_bandwidth). The endpoints must bracket a root.

## Classical Histograms and Distribution Functions

This module contains basic support for "classical" histograms and distribution functions.

Histograms are normalized in such a way that the sum over all bins equals the return value of \$s->count(). Distribution functions are normalized in such a way that the right-most value equals \$s-count().

\$r = \$s->histogram( \$bins )

Given a number \$bins of bins, this function returns a histogram of the data set as a ref to an array of hashes. Each element in the array is a two element hash: { pos => \$x, cnt => \$y }. Here, the value of the pos element is the position of the center of the bin, whereas the value of the cnt element is the weighted number of points in the bin. (If only the default weight has been used for all points, (ie. \$s->add_data( \$x ) without explicit weight), then the value of the cnt element is the number of data points in this bin.

The returned array is sorted in ascending order of bin positions.

The first bin is centered at the smallest data point, the last bin is centered at the greatest data point. All bins have the same width, namely (\$max-\$min)/(\$bins-1). If \$bins==1, a single bin is returned, centered at the half-way point between \$min and \$max.

There is no support for histograms with variable bin widths, nor for any choice in the placement of bins (flush left/right vs. centered).

\$r = \$s->distribution_function()

Returns the cumulative distribution function as a ref to an array of hashes. Each element in the array is a two element hash: { pos => \$x, cnt => \$y }. Here, the value of the pos element gives the x-coordinate and the value of the cnt element gives the corresponding y-value of the cumulative distribution function.

The returned array is sorted in ascending order of x-coordinates.

The cumulative distribution function is obtained by first sorting all data points according to their location, and then summing their weights from left to right. As a consequence, the number of elements in the array returned from this function equals the number of calls to \$s->add_data( \$x ). If multiple calls have been made to \$s->add_data( \$x ) with the same value of \$x, the distribution function will have multiple entries with the same x-coordinate, but increasing y-value.

For descriptive summary statistics, check out the Statistics::Descriptive module on CPAN.

Good information on Kernel Density Estimation can be found in the following books (in descending order of detail):

Kernel Smoothing

by M.P. Wand and M.C. Jones, Chapman and Hall, 1995

Applied Smoothing Techniques for Data Analysis

by A. W. Bowman and A. Azzalini, Oxford University Press, 1997

All of Statistics

by Larry Wasserman, Springer, 2004

# AUTHOR

Philipp K. Janert, <janert at ieee dot org>, http://www.beyondcode.org