The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Statistics::Autocorrelation - Coefficients for any lag, as correlogram, with significance tests

VERSION

Version 0.06

SYNOPSIS

 use Statistics::Autocorrelation 0.06;
 $acorr = Statistics::Autocorrelation->new();
 $coeff = $acorr->coefficient(data => \@data, lag => integer (from 1 to N-1), exact => 0, unbias => 1);
 # or load one or more data, optionally update, and test each discretely:
 $acorr->load(\@data1, \@data2);
 $coeff = $acorr->coeff(index => 0, lag => 1); # default lag => 0

DESCRIPTION

Calculates autocorrelation coefficients for a single series of numerical data, for any valid length of lag.

SUBROUTINES/METHODS

new

 $acorr = Statistics::Autocorrelation->new();

Return a new class object for accessing its methods. This ISA Statistics::Data object, so all the methods for loading, adding, saving, dumping, etc., data in that package are available here.

coefficient

 $coeff = $autocorr->coefficient(data => \@data, lag => integer (from 1 to N-1), exact => 0|1, unbias => 1|0, circular => 1|0);
 $coeff = $autocorr->coefficient(lag => 1); # using loaded data, and default args (exact = 0, unbias = 1, circular = 0)

Alias: coeff, acf

Returns the autocorrelation coefficient, the ratio of the autocovariance to variance of a sequence at any particular lag, ranging from -1 to +1, as in Chatfield (1975) and Kendall (1973). Specifically,

ρk =
γk
σ²k

where k is the lag (see below).

Data can be previously loaded or sent directly here (see Statistics::Data). There must be at least two elements in the data array. A croak will be heard if no data have been loaded or given here.

Options are:

lag

An integer to define how many indices ahead or behind to start correlating the data to itself, as in how many time-intervals separate one value from another. If lag is greater than or equal to number of observations, returns empty string. If the value of lag is less than zero, the calculation is made with its absolute value, given that

ρk = ρk

for all k (so that a coefficient for a lag of -k is equal in magnitude and sign to that for +k). If a value is not given for lag, it is set to the default value of 0.

exact

Boolean value, default = 0. In calculating the autocorrelation coefficient, the convention -- as in corporate stats programs (e.g., SPSS/PASW), and published examples of autocorrelation (e.g., nist.gov), and texts such as Chatfield (1975), and Box and Jenkins (1976) -- is to calculate the sum-of-squares for the autocovariance (the numerator term in the autocorrelation coefficient) from the residuals for each observation x from trial t = 1 (index = 0) to N - k (the lag) relative to the mean of the whole sequence:

γk
1
N
 
Nk
Σ
t=1
(xtx)(xt+kx)

rather than the means for each sub-sequence as lagged, and (2) the sum-of-squares for the variance in the denominator as that of the whole sequence:

σ²k
1
N
 
Nk
Σ
t=1
(xtx

instead of using completely pairwise products. This convention assumes that the series is stationary (has no linear or curvilinear trend, no periodicity), and that the number of observations, N, in the sample is "reasonably large". You get the autocorrelation coefficient with these assumptions, with the above formulations, by default; but if you specify exact => 1, then you get the coefficient as calculated by Kendall (1973) Eq. 3.35, where the sums use not the overall sample mean, but the mean for the first to the N - k elements, and the mean from the k to N elements:

xk
1
Nk
 
Nk
Σ
t=1
xt
, and
xk´
1
Nk
 
Nk
Σ
t=1
xt+k

Taking each observation relative to these means, the autocovariance in the numerator, and variance in the denominator, are calculated as follows to give the autocorrelation coefficient:

ρk
Nk
Σ
t=1
(xtxk)(xt+kxk´)
[
Nk
Σ
t=1
(xtxk]½ [
Nk
Σ
t=1
(xt+kxk´]½
unbias

Boolean, default = 1. In calculating the approximate autocovariance, it is conventional to divide the sum-product of residuals (as given above) by N, but some sources divide by N - lag for less biased estimation, so that

γk
1
Nk
 
Nk
Σ
t=1
(xtx)(xt+kx)

For the latter, set unbias => 0. This is only effective where circular => 0 and exact => 0.

circular

Boolean value, default = 0: For circularized lagging, set circular => 1.

autocovariance

 $covar = $autocorr->autocovariance(data => \@data, lag => integer (from 1 to N-1), exact => 0|1, unbias => 1|0, circular => 1|0);
 $covar = $autocorr->autocovariance(lag => 1); # using loaded data, and default args (exact = 0, unbias = 1, circular = 0)

Alias: autocov, acvf

Returns the autocovariance; see coefficient for definition and options.

correlogram

 $href = $autocorr->correlogram(nlags => integer, exact => 1|0, unbias => 1|0, circular => 1|0); # assuming data are loaded
 $href = $autocorr->correlogram(nlags => integer, exact => 1|0, unbias => 1|0, circular => 1|0); # assuming data are loaded
 $href = $autocorr->correlogram(); # use defaults, with loaded data
 $href = $autocorr->correlogram(data => \@data); # same as either of above, but give data here
 ($lags, $coeffs) = $autocorr->correlogram(); # with args as for either of the above 

Alias: coeff_list

Returns the autocorrelation coefficients for lags from 0 to a limit, or (by default) over all possible lags, from 0 to N - 1. If called in array context, returns two references: to an array of the lags, and an array of their respsective coefficients. Otherwise, returns a hash-reference of the coefficients keyed by their respective lags. The limit is given by argument nlags giving the number of lags to return, including the zero lag, as permitted by the data to be referenced. Options are exact, unbias and circular, as defined above for coefficient. The autocorrelation function being symmetric about lag zero, the correlogram is based only on positive lags.

correlogram_chart

Experimental method to print a .png file of the correlogram.

ctest_bartlett

 $bool = $acorr->ctest_bartlett(lag => integer, tails => 1|2); # assuming data are loaded, or see above for alternative and extra options
 ($crit, $coeff, $bool) = $acorr->ctest_bartlett(lag => integer, tails => 1|2);

Performs a 95% confidence test of the null hypothesis of no autocorrelation, assuming that the series was generated by a Gaussian white noise process. Following Bartlett (1946), it compares the value of a single correlation coefficient for a given lag with the critical values given tails => 2 (default) or 1:

rk,.95
s
N½

where s is a constant equalling 1.96 for a two-tailed, or 1.645 for a one-tailed test. If the absolute value of the sample correlation coefficient falls beyond this critical value, the null hypothesis is rejected at the 95% level.

Returns, if called in array context, a list comprising the critical value, the sample coefficient, and a boolean as to whether the null hypothesis is rejected; otherwise, just the latter boolean.

Accepts all the options as given for coefficient. Note that the critical value is not calculated with respect to the particular value of lag - see ctest_anderson for this.

ctest_anderson

 $bool = $acorr->ctest_bartlett(lag => integer, tails => 1|2); # assuming data are loaded, or see above for alternative and extra options
 ($crit, $coeff, $bool) = $acorr->ctest_b(lag => integer, tails => 1|2);

Performs a 95% confidence test of the null hypothesis of no autocorrelation, assuming that the series was generated by a Gaussian white noise process. Following Anderson (1941), it compares the value of a single correlation coefficient for a given lag with the critical values given tails => 2 (default) or 1:

rk,.95(2-tailed) = 
–1 ±1.96(Nk – 1)½
Nk
rk,.95(1-tailed) = 
–1 + 1.645(Nk – 1)½
Nk

If the sample correlation coefficient falls outside these bounds, the null hypothesis is rejected at the 95% level.

Returns, if called in array context, a list comprising the critical value, the sample coefficient, and a boolean as to whether the null hypothesis is rejected; otherwise, just the latter boolean.

Accepts all the options as given for coefficient. Note that the critical value is calculated with respect to the particular value of lag - unlike ztest_bartlett.

ztest_bartlett

 $p_value = $acorr->ztest_bartlett(lag => integer, tails => 1|2); # assuming data are loaded, or see above for alternative and extra options
 ($z_value, $p_value) = $acorr->ztest_bartlett(lag => integer, tails => 1|2);

Returns the 2- or 1-tailed probability, given tails => 2 (default) or 1, respectively, for the deviation of the observed autocorrelation coefficient at the given lag from the expected value of zero, relative to the variance 1 / N, assuming that the series was generated by a Gaussian white noise process. If called in array context, returns both the actual Z-value and then the p-value. Other options, and methods of assigning the data to test, are as for coefficient.

qtest, boxpierce

 $p_value = $acorr->qtest(nlags => integer); # assuming data are loaded, or see above for alternative and extra options
 ($q_value, $df, $p_value) = $acorr->qtest(nlags => integer);

Returns the Q statistic for testing whether a range of autocorrelation coefficients differs from zero, and so if the series was produced by a random process (Box & Pierce, 1970). If called in array context, returns a list giving the value of Q, and, assuming chi-square distribtution, its degrees of freedom (= nlags) and p-value; returns the p-value only if called in scalar context. Other options, and methods of assigning the data to test, are as for coefficient. The range is (by default) over all possible lags from 1 to N - 1. The statistic is defined as follows:

QN 
M
Σ
k=1
ρk²

where M is the largest lag-value to test (= nlags).

REFERENCES

Anderson, R.L. (1941). Distribution of the serial correlation coefficients. Annals of Mathematical Statistics, 8, 1-13.

Bartlett M.S. (1946). On the theoretical specification of sampling properties of autocorrelated time series. Journal of the Royal Statistical Society, 27.

Box, G.E, & Jenkins, G. (1976). Time series analysis: Forecasting and control. San Francisco, US: Holden-Day.

Box, G.E., & Pierce D. (1970). Distribution of residual autocorrelations in ARIMA time series models. Journal of the American Statistical Association, 65, 1509-1526.

Chatfield, C. (1975). The analysis of time series: Theory and practice. London, UK: Chapman and Hall.

Kendall, M. G. (1973). Time-series. London, UK: Griffin.

SEE ALSO

Statistics::SerialCorrelation (at cpan). Returns single autocorrelation coefficient which, with the present modules, would be given by coefficient given lag => 1, circular => 1 (and the defaults exact => 0, unbias => 0).

AUTHOR

Roderick Garton, <rgarton at cpan.org>

DIAGNOSTICS

No data are available

Croaked by most methods if they do not receive data as given in the call by an array ref, or as pre-loaded as per Statistics::Data.

Value given for argument 'nlags' is not valid

Croaked by correlogram when the nlags is not valid: should be no more than the number of data elements less 1.

file opening/printing errors

Croaked by correlogram_chart when it tries to print the chart.

DEPENDENCIES

Statistics::Data - used: base

Statistics::Lite - used: mean

List::AllUtils - used: mesh

Statistics::Zed - required if calling ztest_bartlett

Math::Cephes - required for igamc method is calling qtest

BUGS AND LIMITATIONS

Report to bug-statistics-autocorrelation-0.06 at rt.cpan.org or http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Statistics-Autocorrelation-0.06.

To do: rho_ctest, rho_ztest

SUPPORT

Find documentation for this module with the perldoc command:

    perldoc Statistics::Autocorrelation

Also look for information at:

LICENSE AND COPYRIGHT

Copyright 2011-2014 Roderick Garton.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.