The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Statistics::Sequences::Runs - The Runs Test: Wald-Wolfowitz runs test descriptives, deviation and combinatorics

VERSION

This is documentation for Version 0.22 of Statistics::Sequences::Runs.

SYNOPSIS

 use strict;
 use Statistics::Sequences::Runs 0.22;
 my $runs = Statistics::Sequences::Runs->new();

 # Data are a sequence of dichotomous strings: 
 my @data = (qw/1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1/);
 my $val;

 # - Pre-load data to use for all methods:
 $runs->load(\@data);
 $val = $runs->observed();
 $val = $runs->expected();

 # - or give data as "data => $aref" to each method:
 $val = $runs->observed(data => AREF);
 
 # - or give frequencies of the 2 "states" in a sequence:
 $val = $runs->expected(freqs => [11, 9]); # works with other methods except observed()

 # Deviation ratio:
 $val = $runs->z_value(ccorr => 1);

 # Probability of deviation from expectation:
 my ($z, $p) = $runs->z_value(ccorr => 1, tails => 1); # dev. ratio with p-value
 $val = $runs->p_value(tails => 1); # normal dist. p-value itself
 $val = $runs->p_value(exact => 1, tails => 1); # p-value by combinatorics

 # Keyed list of descriptives etc.:
 my $href = $runs->stats_hash(values => [qw/observed p_value/], exact => 1);

 # Print descriptives etc. in the same way:
 $runs->dump(
  values => [qw/observed expected p_value/],
  exact => 1,
  flag => 1,
  precision_s => 3,
  precision_p => 7
 );
 # prints: observed = 11.000, expected = 10.900, p_value = 0.5700167

DESCRIPTION

The module returns statistical information re Wald-type runs across a sequence of dichotmous events on one or more consecutive trials. For example, given an accuracy-based sequence composed of matches (H) and misses (M) like (H, H, M, H, M, M, M, M, H), there are 5 runs: 3 for Hs, 2 for Ms. This observed number of runs can be compared with the number expected to occur by chance over the number of trials, relative to the expected variance. More runs than expected ("negative serial dependence") can denote irregularity, instability, mixing up of alternatives. Fewer runs than expected ("positive serial dependence") can denote cohesion, insulation, isolation of alternatives. Both can indicate sequential dependency: either negative (an alternation bias), or positive (a repetition bias).

The distribution of runs is asymptotically normal, and a deviation-based test of extra-chance occurrence when at least one alternative has more than 20 occurrences (Siegal rule), or both event occurrences exceed 10 (Kelly, 1982), is conventionally considered reliable; otherwise, the module provides an "exact test" based on combinatorics.

For non-dichotomous, continuous or multinomial data, see Statistics::Data::Dichotomize for potentially transforming them for runs descriptives/tests.

SUBROUTINES/METHODS

Data-handling

new

 $runs = Statistics::Sequences::Runs->new();

Returns a new Runs object. Expects/accepts no arguments but the classname.

load

 $runs->load(ARRAY);
 $runs->load(AREF);
 $runs->load(foodat => AREF); # named whatever

Loads a sequence anonymously or by name - see load in the Statistics::Data manpage for details on the various ways data can be loaded, updated and then retrieved. Every load unloads all previous loads and any updates to them.

Alternatively, skip this action; data don't always have to be loaded to use the stats methods here. The sequence can be provided with each method call, as shown below, or by simply giving the observed counts of runs (apart, of course, for calculating these counts, when a specific sequence is needed).

add, access, unload

See Statistics::Data for these additional operations on data that have been loaded.

Descriptives

observed

 $v = $runs->observed(); # use the data loaded anonymously
 $v = $runs->observed(name => 'foodat'); # ... or the name given on loading
 $v = $runs->observed(data => AREF); # ... or just give the data now

Returns the total observed number of runs in the loaded or given data. For example,

 $v = $runs->observed(data => [qw/H H H T T H H/]);

returns 3 (for the runs 'HHH', 'TT' and 'HH').

observed_per_state

 @freq = $runs->observed_per_state(data => AREF);
 $href = $runs->observed_per_state(data => AREF);

Returns the number of runs per state - as a two-dimensional array where the first element gives the count for the first state in the data, and so for the second. A hashref is returned if not called in list context, the frequencies keyed by state. For example:

 @ari = $runs->observed_per_state(data => [qw/H H H T T H H/]); # returns (2, 1)
 $ref = $runs->observed_per_state(data => [qw/H H H T T H H/]); # returns { H => 2, T => 1}

Exceptions: If there was only one state in the loaded/given sequence (e.g., data => [qw/H H H/]), there is only one run and so the returned array will be one-dimensional, i.e., (1), and the returned hashref has only a single key (for this example: { H => 1 }). If there are no states, with an empty array loaded/given for the sequence, then the same applies, except the returned array is (0) and the returned hashref has the empty string as its single key ( q{} => 0 ).

expected

 $v = $runs->expected(); # or specify loaded data by "name", or give as "data"
 $v = $runs->expected(data => AREF); # use these data
 $v = $runs->expected(freqs => [POS_INT, POS_INT]); # no actual data; calculate from these two Ns

Returns the expected number of runs across the loaded data. Expectation is given as follows:

  E[R] = ( (2n1n2) / (n1 + n2) ) + 1

where n(i) is the number of observations of each element in the data.

variance

 $v = $runs->variance(); # use data already loaded - anonymously; or specify its "name" 
 $v = $runs->variance(data => AREF); # use these data
 $v = $runs->variance(freqs => [POS_INT, POS_INT]); # use these counts - not any particular sequence of data

Returns the variance in the number of runs for the given data.

  V[R] = ( (2n1n2)([2n1n2] – [n1 + n2]) ) / ( ((n1 + n2)2)((n1 + n2) – 1) )

defined as above for expected.

The data to test can already have been loaded, or you send it directly as a flat referenced array keyed as data.

observed_deviation

 $v = $runs->obsdev(); # use data already loaded - anonymously; or specify its "name"
 $v = $runs->obsdev(data => AREF); # use these data

Returns the deviation of (difference between) observed and expected runs for the loaded/given sequence (O - E).

Alias: obsdev

standard_deviation

 $v = $runs->stdev(); # use data already loaded - anonymously; or specify its "name"
 $v = $runs->stdev(data => AREF);
 $v = $runs->stdev(freqs => [POS_INT, POS_INT]); # don't use actual data; calculate from these two Ns

Returns square-root of the variance.

Alias: stdev, stddev

skewness

 $v = $runs->skewness(); # use data already loaded - anonymously; or specify its "name"
 $v = $runs->skewness(data => AREF); # use these data

Returns run skewness as given by Barton & David (1958) based on the frequencies of the two different elements in the sequence.

kurtosis

 $v = $runs->kurtosis(); # use data already loaded - anonymously; or specify its "name"
 $v = $runs->kurtosis(data => AREF); # use these data

Returns run kurtosis as given by Barton & David (1958) based on the frequencies of the two different elements in the sequence.

Distribution and tests

pmf

 $p = $runs->pmf(data => AREF); # or no args to use last pre-loaded data
 $p = $runs->pmf(observed => POS_INT, freqs => [POS_INT, POS_INT]);

Implements the runs probability mass function, returning the probability for a particular number of runs given so many dichotomous events (e.g., as in Swed & Eisenhart, 1943, p. 66); i.e., for u' the observed number of runs, P{u = u'}. The required function parameters are the observed number of runs, and the frequencies (counts) of each state in the sequence, which can be given directly, as above, in the arguments observed and freqs, respectively, or these will be worked out from a given data sequence itself (given here or as pre-loaded). For derivation, see its public internal methods n_max_seq and m_seq_k, which make use of the choose() method from Orwant et al. (1999).

cdf

 $p = $runs->cdf(data => AREF); # or no args to use last pre-loaded data
 $p = $runs->cdf(observed => POS_INT, freqs => [POS_INT, POS_INT]);

Implements the cumulative distribution function for runs, returning the probability of obtaining the observed number of runs or less down to the expected number of 2 (assuming that the two possible events are actually represented in the data), as per Swed & Eisenhart (1943), p. 66; i.e., for u' the observed number of runs, P{u <= u'}. The summation is over the probability mass function pmf. The function parameters are the observed number of runs, and the frequencies (counts) of the two events, which can be given directly, as above, in the arguments observed and freqs, respectively, or these will be worked out from a given data sequence itself (given here or as pre-loaded).

cdfi

 $p = $runs->cdfi(data => AREF); # or no args for last pre-loaded data
 $p = $runs->cdfi(observed => POS_INT, freqs => [POS_INT, POS_INT]);

Implements the (inverse) cumulative distribution function for runs, returning the probability of obtaining more than the observed number of runs up from the expected number of 2 (assuming that the two possible events are actually represented in the data), as per Swed & Eisenhart (1943), p. 66; ; i.e., for u' the observed number of runs, P = 1 - P{u <= u' - 1}. The summation is over the probability mass function pmf. The function parameters are the observed number of runs, and the frequencies (counts) of the two events, which can be given directly, as above, in the arguments observed and freqs, respectively, or these will be worked out from a given data sequence itself (given here as data or as pre-loaded).

z_value

 $v = $runs->z_value(ccorr => BOOL); # use data already loaded - anonymously; or specify its "name"
 $v = $runs->z_value(data => AREF, ccorr => BOOL);
 ($zvalue, $pvalue) = $runs->z_value(data => AREF, ccorr => BOOL, tails => 1|2); # wanting an array, get p-value too

Returns the normal deviate from a test of runcount deviation, taking the runcount expected from that observed and dividing by the root variance, by default with a continuity correction to expectation. Called wanting an array, returns the Z-value with its p-value for the tails (1 or 2) given. The returned value is an empty string if the variance is undefined, empty or equals 0 (as when there is only one state in the sequence).

The data to test can already have been loaded, or sent directly as an aref keyed as data.

Other options are precision_s (for the z_value) and precision_p (for the p_value).

Aliases: zscore, zvalue

p_value

 $p = $runs->p_value(); # using loaded data and default args
 $p = $runs->p_value(ccorr => BOOL, tails => 1|2); # normal-approx. for last-loaded data
 $p = $runs->p_value(exact => BOOL); # calc combinatorially for observed >= or < than expectation
 $p = $runs->p_value(data => AREF, exact => BOOL); #  given data
 $p = $runs->p_value(observed => POS_INT, freqs => [POS_INT, POS_INT]); # no data sequence, specify known params

Returns the probability of getting the observed number of runs or a smaller number given the number of each of the two events. By default, a large sample is assumed, and the probability is obtained from the normalized deviation, as given by the z_value method.

If the option exact is defined and not zero, then the probability is worked out combinatorially, as per Swed & Eisenhart (1943), Eq. 1, p. 66 (and also Siegal, 1956, Eqs. 6.12a and 6.12b, p. 138). This is only implemented as a one-tailed test; the tails option has no effect. This tests the hypotheses that there are either too many or too few runs relative to chance expectation; which of these hypotheses is tested is based on the expected value returned by the expected method, using cdfi if there are more runs than expected, or cdf if there are fewer runs than expected; use these functions themselves to specify the hypothesis to be tested.

If there is only one state/event in the sequence, then the variance from the expected value of 1 is 0, and this method returns 1 (however long this single event sequence is, the observed number of runs cannot differ from the expected number of runs). If the sequence is empty, an empty string is returned.

Output from these tests has been checked against the tables and examples in Swed & Eisenhart (given to 7 decimal places), and found to agree.

The option precision_p gives the returned p-value to so many decimal places.

Aliases: pvalue

ztest_ok

 $bool = $runs->ztest_ok(); # use data already loaded - anonymously; or specify its "name"
 $bool = $runs->ztest_ok(data => AREF);

Returns true for the loaded sequence if its constituent sample numbers are sufficient for their expected runs to be normally approximated - using Siegal's (1956, p. 140) rule - ok if either of the two Ns are greater than 20.

Utils

Methods used internally, or for returning/printing descriptives, etc., in a bunch.

bi_frequency

 @freq = $runs->bi_frequency(data => AREF); # or no args if using last pre-loaded data

Returns frequency of the two elements - or croaks if there are more than 2, and gives zero for any absent.

n_max_seq

 $n = $runs->n_max_seq(); # loaded data
 $n = $runs->n_max_seq(data => AREF); # this sequence
 $n = $runs->n_max_seq(observed => POS_INT, freqs => [POS_INT, POS_INT]); # these counts

Returns the number of possible sequences for the two given state frequencies. So the urn contains N1 black balls and N2 white balls, well mixed; taking N1 + N2 drawings from it without replacement, any sequence has the same probability of occurring; how many different sequences of black and white balls are possible? For the two counts, this is "sum of N1 + N2 choose N1", or:

   Nmax = ( N1 + N2 )! / N1!N2!

This is the denominator term in the runs probability mass function (pmf); not taking into account probability of obtaining so many of each event, of the proportion of black and white balls in the urn.

m_seq_k

 $n = $runs->m_seq_k(); # loaded data
 $n = $runs->m_seq_k(data => AREF); # this sequence
 $n = $runs->m_seq_k(observed => POS_INT, freqs => [POS_INT, POS_INT]); # these counts

Returns the number of sequences that can produce k runs from m elements of a single kind, with all other kinds of elements in the sequence assumed to be of a single kind, under the conditions of n_max_seq. See Swed and Eisenhart (1943), or Barton and David (1958, p. 253). With the frequentist probability M / N, this is the numerator term in the runs probability mass function (pmf).

stats_hash

 $href = $runs->stats_hash(values => [qw/observed expected z_value/], precision_s => POS_INT, ccorr => BOOL); # among other values/options
 $href = $runs->stats_hash(values =>
  {
   observed => BOOL,
   expected => BOOL,
   variance => BOOL,
   z_value => BOOL,
   p_value => BOOL,
  },
  exact => BOOL,    # for p_value
  ccorr => BOOL # for z_value
 );

Returns a hashref for the counts and stats as specified in its "values" argument, and with any options for calculating them (e.g., exact for p_value). See "stats_hash" in Statistics::Sequences for details. If calling via a "runs" object, the option "stat => 'runs'" is not needed (unlike when using the parent "sequences" object).

dump

 $runs->dump(values => [qw/observed expected z_value/], precision_s => POS_INT, ccorr => BOOL); # among other values/options
 $runs->dump(values =>
  {
   observed => BOOL,
   expected => BOOL,
   variance => BOOL,
   z_value => BOOL,
   p_value => BOOL,
  },
  precision_s => POS_INT,
  precision_p => POS_INT,   # for p_value
  flag  => BOOL,    # for p_value
  exact => BOOL,    # for p_value
  ccorr => BOOL # for z_value
 );

Print Runs-test results to STDOUT, including the stats as given a true value by their method names in a referenced hash of values, and with options relevant to thesemethods (see the template above). Default values to dump are observed() and p_value()). Optionally also give the data directly.

dump_data

 $runs->dump_data(delim => "\n"); # print whatevers loaded (or specify by name, or as "data") 

See "dump_data" in Statistics::Sequences for details.

EXAMPLE

Seating at the diner

Swed and Eisenhart (1943) list the occupied (O) and empty (E) seats in a row at a lunch counter. Have people taken up their seats on a random basis?

 use Statistics::Sequences::Runs;
 my $runs = Statistics::Sequences::Runs->new();
 my @seating = (qw/E O E E O E E E O E E E O E O E/); # data already form a single sequence with dichotomous observations
 $runs->dump(data => \@seating, exact => 1, tails => 1);

Suggesting some non-random basis for people taking their seats, this prints:

 observed = 11, p_value = 0.054834

But these data would fail Siegal's rule (ztest_ok = 0) (neither state has 20 observations). So just check exact probability of the hypothesis that the observed deviation is greater than zero (1-tailed):

 $runs->dump(data => \@seating, values => {'p_value'}, exact => 1, tails => 1);

This prints a p-value of .0576923 (so the normal approximation seems good in any case).

These data are also used in an example of testing for Vnomes.

Runs in multinomial matching

In a single run of a classic ESP test, there are 25 trials, each composed of a randomly generated event (typically, one of 5 possible geometric figures), and a human-generated event arbitrarily drawn from the same pool of alternatives. Tests of the match between the random and human data are typically for number of matches observed versus expected. The runs of matches and misses can be tested by dichotomizing the data on the basis of the match of the random "targets" with the human "responses", as described by Kelly (1982):

 use Statistics::Sequences::Runs;
 use Statistics::Data::Dichotomize;
 my @targets = (qw/p c p w s p r w p c r c r s s s s r w p r w c w c/);
 my @responses = (qw/p c s c s s p r w r w c c s s r w s w p c r w p r/);

 # Test for runs of matches between targets and responses:
 my $runs = Statistics::Sequences::Runs->new();
 my $ddat = Statistics::Data::Dichotomize->new();
 $runs->load($ddat->match(data => [\@targets, \@responses]));
 $runs->dump_data(delim => ' '); # have a look at the match sequence; prints "1 1 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0\n"
 print "Probability of these many runs vs expectation: ", $runs->test(), "\n"; # 0.51436
 # or test for runs in matching when responses are matched to targets one trial behind:
 print $runs->test(data => $ddat->match(data => [\@targets, \@responses], lag => -1)), "\n"; # 0.73766

DEPENDENCIES

List::AllUtils : uses methods mesh, sum and uniq

Number::Misc : uses method is_even in the probability-mass-function computation

Statistics::Sequences : base module

Statistics::Zed : for normality-wise statistical testing

SEE ALSO

Statistics::Sequences : for other tests of sequences, for sharing data between these tests, such as ...

Statistics::Sequences::Pot : another test of sequential structure, assessing exponential clustering of events.

REFERENCES

These papers provide the implemented algorithms and/or the sample data used in examples and tests.

Barton, D. E., & David, F. N. (1958). Non-randomness in a sequence of two alternatives: II. Runs test. Biometrika, 45, 253-256. doi: 10.2307/2333062

Kelly, E. F. (1982). On grouping of hits in some exceptional psi performers. Journal of the American Society for Psychical Research, 76, 101-142.

Orwant, J., Hietaniemi, J., & Macdonald, J. (1999). Mastering algorithms with Perl. Sebastopol, CA, US: O'Reilly.

Siegal, S. (1956). Nonparametric statistics for the behavioral sciences. New York, NY, US: McGraw-Hill.

Swed, F., & Eisenhart, C. (1943). Tables for testing randomness of grouping in a sequence of alternatives. Annals of Mathematical Statistics, 14, 66-87. doi: 10.1214/aoms/1177731494

Wald, A., & Wolfowitz, J. (1940). On a test whether two samples are from the same population. Annals of Mathematical Statistics, 11, 147-162. doi: 10.1214/aoms/1177731909

Wolfowitz, J. (1943). On the theory of runs with some applications to quality control. Annals of Mathematical Statistics, 14, 280-288. doi: 10.1214/aoms/1177731421

The test scripts also implement the example data from www.reiter1.com.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Statistics::Sequences::Runs

You can also look for information at:

AUTHOR

Roderick Garton, <rgarton at cpan.org>

LICENSE AND COPYRIGHT

This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see http://www.perl.com/perl/misc/Artistic.html).

Disclaimer

To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.