NAME
Statistics::Sequences::Vnomes  Serial Test psisquare for equiprobability of vnomes (or Ngrams) (Good's and KendallBabingtonSmith's tests)
SYNOPSIS
use Statistics::Sequences::Vnomes 0.20;
my $vnomes = Statistics::Sequences::Vnomes>new();
$vnomes>load(qw/1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1/); # ordered data, numerical or other, binary or what
my $freq_href = $vnomes>observed(length => 3); # returns hashref of frequencies for trinomes in the sequence
my @freq = $vnomes>observed(length => 3); # returns only observed trinome frequencies (not keyed by trinomes themselves)
$val = $vnomes>expected(length => 3); # mean chance expectation for the frequencies (2.5)
$val = $vnomes>psisq(length => 3); # Good's "second backward differences" psisquare (3.161); see option 'delta'
$val = $vnomes>p_value(length => 3); # for psisquare (0.206)
my $href = $vnomes>stats_hash(length => 3, values => {psisq => 1, p_value => 1}); # include any method (& their options)
$vnomes>dump(length => 3,
values => {observed => 1, expected => 1, psisq => 1, p_value => 1}, # what stats to show (or not if => 0)
format => 'table', flag => 1, precision_s => 3, precision_p => 7, verbose => 1);
# prints:
# Vnomes (3) statistics
#.+++.
# observed  expected  p_value  psisq 
#    
#+++++
# '011' = 4  2.500  0.2058681  3.161 
# '010' = 1    
# '111' = 2    
# '000' = 1    
# '101' = 2    
# '001' = 3    
# '100' = 3    
# '110' = 4    
#'+++'
# psisq is Good's delta^2psi^2, calculated with second backward differences, and has 2 degrees of freedom.
# following and other methods inherited from Statistics::Sequences and its parent, Statistics::Data:
$vnomes>dump_data(delim => ','); # commaseparated single line of the loaded data
$vnomes>save_to_file(path => 'seq.csv'); # for retrieval by load_from_file() method
DESCRIPTION
This module implements the KendallBabingtonSmith and Good's serial tests of the independence of successive elements within a sequence. At the least, it can be used to test that the individual elements in the sequence are uniformly distributedwhen giving a length of 1 to the function psisq. More generally, an array of data is given as an ordered categorical ("stringy") series of events, and a description of this array is wanted in terms of the frequency of its constituent "blocks" of length v, and a test is made of how these counts differ from the counts expected for a sequence generated by the same process, for as many events, of the same length.
Successive elements are defined as vnomes, a.k.a Ngrams, vplets or vbits (for binary data); that is, as mononomes, monograms, singletons, or monobits of discrete events; or dinomes, bigrams, doublets, or so on for immediate repetitions and alternations; or as trinomes, trigrams, triplets, etc., and so on, for higherorder sequences.
The test implemented here is an alternative to the monobit "frequency" test, and the chisquare and likelihood ratio (Gsquare) tests of independence. These tests generally produce values that are only asymptotically distributed as chisquare; a problem solved by Good in offering a chisquare informed by backward differencing.
Take the following, for example. A sequence of heads and tails (H and T) is generated by coinflipping. A serial test of the randomness (informativeness, predictability, instability, etc.) of this sequence might involve testing for equal representation of, say, the trinomes HTH, HTT, TTT, THT, and so on. Counting up these trinomes at overlapping points in the sequence (with a moving time window) yields a statisticpsisquarethat is approximately distributed as chisquare; the KendallBabington Smith statistic. However, these counts are not independent (given overlaps). Good's Generalized Serial Testthe default teststatistic returned by this module's psisq routinecomputes psisquare by differencing, taking into account not only the specified length of v but also its value for the first two prior lengths (dinomes and mononomes). This yields the statistic deltasquarepsisquare (the "second backward difference" measure) that is exactly distributed as chisquare. (Anything less than a test for trinomes naturally has to fall back on first backward differencing or no differencing at allwhich, again, is the KendallBabingtonSmith test).
Note also that (1) Good's serial test, as here implemented, is suitable for multistate, multinomial datanot only binary, dichotomous sequences (and to which, say, the runs or joins tests are limited); that (2) Good's serial test is not the same as the socalled "serial test" described by Knuth (1998, Ch. 2)which involves nonoverlapping pairs of events within a sequence; and that (3) Good's original paper for this test (i.e., Good, 1953) misprinted the crucial calculationi.e., without squaring the observation less expectation differenceso that versions of this module pre0.20 (which implemented the former) are not compatible (to say the least) with earlier versions.
METHODS
This module is a "child" of Statistics::Sequences, and so of Statistics:Data. So it offers generic methods, as follows, aside from the specific methods for describing a sequence as per Good's test.
new
$vnomes = Statistics::Sequences::Vnomes>new();
Returns a new Vnomes object. Expects/accepts no arguments but the classname.
load
$vnomes>load(@data); # anonymously
$vnomes>load(\@data);
$vnomes>load('sample1' => \@data); # labelled whatever
Loads data anonymously or by name  see load in Statistics::Data for ways data can be loaded and retrieved (more than shown here). Every load unloads all previous loads and any additions to them.
Data for this test of sequences can be categorical or numerical, all being treated categorically. Also, the data do not have to be dichotomous (unlike in tests of runs and joins.
add, access, unload
See Statistics::Data for these and other operations on loaded data.
observed
$href = $vnomes>observed(length => INT > 0, circularize => BOOL); # returns keyed distribution; assumes data have already been loaded
@ari = $vnomes>observed(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => INT > 0, circularize => BOOL); # returns frequencies only
Returns the frequency distribution for the observed number of vnomes of the given length in a sequence. These are counted up as overlapping. Called in array context, returns an array of the counts for each possible pattern; otherwise, where these are keyed by the pattern in a hash reference. For descriptives of the observed frequencies, try calling Statistics::Lite methods with this array.
A value for length greater than zero is required, and must be no more than the samplesize. So for the sequence
1 0 0 0 1 0 0 1 0 1 1 0
there are 12 vnomes of length 1 (mononomes, the number of elements) from the two (0 and 1) that are possible; 11 dinomes from the four (10, 00, 00, 01) that are possible; and 10 trinomes from eight (100, 000, 001, 010, etc.) that are possible.
By default, the sequence is counted up for vnomes as a cyclic sequence, treating the first element of the sequence as following the last one. So the count loops to the beginning of the sequence until all elements from the end are included, and instead of ending the count for trinomes in this sequence at '110', the count includes '101' and '010', increasing the observed sum of trinomes to 12. Set circularize => 0 if the count is not to be made cyclically.
The data to test can already have been loaded, or you send it directly keyed as data.
In the code for this and/or other methods, following the work of Good (1953, 1957; Good & Gover, 1967), v is used to define variables referring to the length of the subsequence (mononome, dinome, trinome, etc.), t defines the number of states (events, letters, etc.) in the sequence, and r identifies the number of each possible subsequence of length v.
expected
$count = $vnomes>expected(length => INT > 0 > 0, circularize => BOOL); # assumes data have already been loaded
$count = $vnomes>expected(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => INT > 0, circularize => BOOL, states => [0, 1]);
Returns the expected number of observations for each vnome of the given length in a sequence; i.e., the mean chance expectation, assuming that each event is generated by a random uniform process. Options are as for observed. This expected frequency is given by:
E[V] = Nt^{–v}
where t is the number of possible states (alternatives; as read from the data or as explicitly given as an array in states), v is the vnome length, and N is the length of the sequence, less v + 1 if the count is not to be circularized (Good, 1953, Eq. 12).
Another way to think of this is as the number of mononome observations in the sequence divided by the number of possible permutations of its states for the given length. So, for a sequence made up of 0s and 1s, there are four possible variations of length 2 (00, 10, 01 and 11), so that the expected frequency for each of these variations in a sequence of 20 values is 20 / 4, i.e., 5.
This is a theoretical (deductive) estimate based on the given states  not an empirical (inductive) estimate based on the given data. So all the expected frequencies for each combination are equal, unlike what you might get by way of conditional probabilities for combinations of states from the given data.
variance
$var = $vnomes>variance(length => INT > 0, circularize => BOOL); # assumes data have already been loaded
$var = $vnomes>variance(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => INT > 0, circularize => BOOL, states => [0, 1]);
Returns the variance in the expected frequency of each vnome, as per Good (1953, Eqs. 11 and 14).
stdev
$var = $vnomes>stdev(length => INT > 0, circularize => BOOL); # assumes data have already been loaded
$var = $vnomes>stdev(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => INT > 0, circularize => BOOL, states => [0, 1]);
Returns the standard deviation in the expected frequency of each vnome, as per Good (1953, Eqs. 11 and 14).
psisq
$psisq = $vnomes>psisq(length => INT > 0, delta => '210', circularize => '10', states => [qw/A C G T/]);
($psisq, $df, $p_value) = $vnomes>psisq(length => INT > 0, delta => '210', circularize => '10', states => [qw/A C G T/]);
Performs Good's Generalized Serial Test (by default), of vnomes on the given or named distribution, yielding a psisquare statistic. Returns the statistic itself, and, if called in array context, also the degrees of freedom. The option delta specifies backward differencing.
 delta => 0

The raw psisquare value for subsequences of length v (KendallBabington Smith statistic), i.e., without backward differencing.
Ψ²_{v} = ∑{r} ( n_{r}– (N – v + 1) / t^{v} )² ) / ( (N – v + 1) / t^{v} )
for uncircularized sequences (Good, 1953, Eq. 1), and
Ψ²_{v} = ∑{r} ( n_{r}– Nt^{–v} ) / Nt^{–v}
for circularized sequences (Good, 1953, Eq. 2), where an overline indicates circularization, and where N is the length of the sequence, v is the vnome length, t is the number of unique states that each element of the sequence can take, and r is the number of variations of length v for the given number of unique states.
This statistic is only asymptotically distributed as chisquare if length (v) = 1, and is therefore not used as the default.
 delta => 1

The "first backward differences" of psisquare, which is the difference between the psisquare values for subsequences of length v and length v  1.
ΔΨ²_{v} = Ψ²_{v} – Ψ²_{v–1} (v ≥ 1)
While it is chisquare distributed, counts of firstdifferences are not statistically independent (Good, 1953; Good & Gover, 1967), and "the sequence of second differences forms a much better set of statistics for testing the hypothesis of flatrandomness" (Good & Gover, 1967, p. 104).
 delta => 2 (or undef)

This method returns by default (without specifying a value for delta), or if delta is not 1 or 0, the "second backward differences" psisquare (delta^2psi^2). This incorporates psisquare values for backwardly adjacent values of length (v), i.e., for subsequences of length v, v  1, and v  2.
Δ²Ψ²_{v} = Ψ²_{v} – 2Ψ²_{v–1} – Ψ²_{v–2} (v ≥ 2)
It is not only asymptotically chisquare distributed, but uses statistically independent counts of all the possible variations of sequences of the given length (Good, 1953).
See also Good's algorithm in individual papers describing application of the Serial Test (e.g., Davis & Akers, 1974).
p_value
$p = $vnomes>p_value(length => INT > 0); # using loaded data and default args
$p = $vnomes>p_value(length => INT > 0, data => [1, 0, 1, 1, 0], exact => 1); # using given data (bypassing load and access)
$p = $vnomes>p_value(length => INT > 0, trials => 20, observed => 10); # without using data
Returns probability of obtaining the psisq value for data already loaded, or directly keyed as data. The pvalue is read off the complemented chisquare distribution (incomplete gamma integral) using Math::Cephes igamc
.
dump
$vnomes>dump(length => 3, values => {psisq => 1, p_value => 1}, format => 'tablelablinecsv', flag => 1, precision_s => 3, precision_p => 7, verbose => 1);
Print Vnometest results to STDOUT. See dump in the Statistics::Sequences manpage for details. If verbose => 1, then you get (1) the actual teststatistic depending on the value of delta tested (delta^2psi^2 for the second difference measure (default), deltapsi^2 for the first difference measure, and psi2 for the raw measure), followed by degreesoffreedom in parentheses; and (2) a warning, if relevant, that your length value might be too large with respect to the sample size (see NIST reference, above, in discussing length). If text => 1, you just get the average observed and expected frequencies for each vnome, the Zvalue, and its associated pvalue.
nnomes
$r = $vnomes>nnomes(length => INT > 0, data => AREF); # supply the data directly, and assume all possible states are in the given sequence
$r = $vnomes>nnomes(length => INT > 0); # assuming data have been "loaded" and all possible states are in the loaded sequence
$r = $vnomes>nnomes(length => INT > 0, states => AREF); # ... or specify the states in case not all in the sequence
Returns the number of possible subsequences of the given length (v) for the given number of states (t). This is the quantity denoted as r in Good's (1953, 1957) papers; i.e.,
r(v) = t^{v}
The method needs to have, of course, a sequence a test. This can be directly given to the method (as the referenced array for the named argument data) or it can be preloaded, as by using $vnomes>load(AREF).
Then, the method needs to know two things: the "v" value itself, i.e., the length of the possible subsequences to test (mononomes, dinomes, trinomes, etc.), and the number of states (events, letters, etc.) that the process generating the data could take (from 1 to whatever). The "v" value is always required to be specified by the named argument length, and it should be a positive integer value no greater than the length of the (loaded/given) sequence. The states can be directly given in the named argument states (recommended), or from the states that "empirically" exist in the loaded/given data.
prob_r
$Pr = $vnomes>prob_r(length => $v); # length is 1 (mononomes) by default
Returns the probability of the occurrence of any of the individual elements ("digits") in the sequence (v = 1), or of the given length, assuming they are equally likely and independent.
P_{r} = t^{–v}
OPTIONS
Options common to the above stats methods.
length
This is currently a required "option", giving the length of the vnome of interest, i.e., the value of v  an integer greater than or equal to 1, and smaller than than the samplesize.
What is a meaningful maximal value of length? As a chisquare test, it is conventionally required that the expected frequency is at least 5 for each vnome (Knuth, 1988). This can be judged to be too conservative (Delucchi, 1993). The NIST documentation on the serial test (Rukhin et al., 2010) recommends that length should be less than the floored value of log2 of the samplesize, minus 2. No tests are here made of these recommendations.
circularize
By default, observed and expected counts, and the value of psisq, are made by treating the sequence as a cyclic one, where the first element of the sequence follows the last one. This affects (and simplifies) the calculation of the expected frequency of each vnome, and so the value of each psisquare. Also, circularizing ensures that the expected frequencies are accurate; otherwise, they might only be approximate. As Good and Gover (1967) state, "It is convenient to circularize in order to get exact checks of the arithmetic and also in order to simplify some of the theoretical formulae" (p. 103). These methods, however, can also treat the sequence noncyclically by calling them with circularize => 0.
states
Optionally send a referenced array listing the unique states (or 'events', 'letters') in the population from which the sequence was sampled, e.g., states => [qw/A C G T/]. This is useful if the sequence itself might not include all the possible states. If this is not specified, the states are identified from the sequence itself. If giving a list of states, a check in each test is made to ensure that the sequence contains only those elements in the list.
EXAMPLE
Seating at the diner
This is the data from Swed and Eisenhart (1943) also given as an example for the Runs test and Turns test. It lists the occupied (O) and empty (E) seats in a row at a lunch counter. Have people taken up their seats on a random basis  or do they show some social phobia (more sparesly seated than "chance"), or are they trying to pick up (more compactly seated than "chance")? What does Good's test of Vnomes reveal?
use Statistics::Sequences::Vnomes;
my $vnomes = Statistics::Sequences::Vnomes>new();
my @seating = (qw/E O E E O E E E O E E E O E O E/); # data from Swed & Eisenhart (1943)
$vnomes>load(\@seating); # as per Statistics::Data
$vnomes>dump_vals(delim => q{,}); # via Statistics::Data  prints E,O,E,E,O,E,E,E,O,E,E,E,O,E,O,E
$vnomes>dump(
length => 3,
values => { psisq => 1, p_value => 1 },
format => 'labline',
flag => 1,
precision_s => 3,
precision_p => 3,
circularize => 0,
verbose => 1,
);
This prints:
Vnomes (3): p_value = 0.044*, psisq = 6.250
That is, the observed frequency of each possible trio of seating arrangements (the trinomes OOO, OOE, OEE, EEE, etc.) differed significantly from that expected. Look up the observed frequencies for each possible trinome to see if this is because there are more empty or occupied neighbouring seats ("phobia" or "philia"):
$vnomes>dump(length => 3, values => {observed => 1}, format => 'labline');
This prints:
observed = ('EEO' = 4,'EOO' = 0,'OEO' = 1,'OOO' = 0,'EOE' = 5,'EEE' = 2,'OEE' = 4,'OOE' = 0)
As the chanceexpected frequency is 2.5 (from the expected method), there are clearly more than expected trinomes involving empty seats than occupied seats  suggesting a nonrandom factor like social phobia (or body odour?) is at work in sequencing people's seating here. Noting that the sequencing isn't significant for dinomes (with length => 2) might also tell us something about what's going on. What happens for vnomes of 4 or more in length? Maybe the runs or pot test might be a better summary of what's going on.
DIAGNOSTICS
 vnome length needs to be defined and greater than zero

croak
ed from various methods (including observed, expected, variance and psisq) if the argument length is not defined or if it defined but equals zero.
REFERENCES
Davis, J. W., & Akers, C. (1974). Randomization and tests for randomness. Journal of Parapsychology, 38, 393407.
Delucchi, K. L. (1993). The use and misuse of chisquare: Lewis and Burke revisited. Psychological Bulletin, 94, 166176. doi: 10.1037/00332909.94.1.166
Gatlin, L. L. (1979). A new measure of bias in finite sequences with applications to ESP data. Journal of the American Society for Psychical Research, 73, 2943. (Used in reference tests.)
Good, I. J. (1953). The serial test for sampling numbers and other tests for randomness. Mathematical Proceedings of the Cambridge Philosophical Society, 49, 276284.
Good, I. J. (1957). On the serial test for random sequences. Annals of Mathematical Statistics, 28, 262264. doi: 10.1214/aoms/1177707053
Good, I. J., & Gover, T. N. (1967). The generalized serial test and the binary expansion of [squareroot]2. Journal of the Royal Statistical Society A, 130, 102107.
Kendall, M. G., & Babington Smith, B. (1938). Randomness and random sampling numbers. Journal of the Royal Statistical Society, 101, 147166.
Knuth, D. E. (1998). The art of computer programming (3rd ed., Vol. 2 Seminumerical algorithms). Reading, MA, US: AddisonWesley.
Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh, S., et al. (2010). A statistical test suite for random and pseudorandom number generators for cryptographic applications. Retrieved September 4 2010, from http://csrc.nist.gov/groups/ST/toolkit/rng/documents/SP80022b.pdf, and July 17, 2013, from http://csrc.nist.gov/publications/nistpubs/80022rev1a/SP80022rev1a.pdfSP80022rev1a.pdf (revised).
SEE ALSO (RELATION to NGRAMS)
Vnomes are much the same as Ngrams (or Ngrams), except that Ngram analysis is usually defined for whole strings, i.e., where all elements in the string are "chunked" as a single unit of analysis, and then often without respect to how they occur within a broader sequence of events.
Conversely, vnomes are based on respecting the frequency of each unit of analysis (each element in the "chunk"), and these units are analysed with respect to the sequence from which they're derived, as an ordered list of events.
There are several Perl modules for working with Ngram strings, and others that work with vnome lists, for one or more orders of sequence (i.e., N or v lengths). For example ...
Algorithm::NGram works with spacedelimited strings to produce a ngram frequency table, among other things.
Lingua::EN::Ngram extracts ngrams from texts, and lists them according to frequency and/or TScore.
Statistics::Frequency provides frequencies for mononomes (only) within a list.
Statistics::Gtest calculates the likelihood ratio (Gsquare) statistic (which is asymptotically distributed chisquare, whereas Good's delta^2psi^2 is calculated (by circularization and differencing) to be precisely chi^2 distributed).
Statistics::Lite provides frequencies for mononomes (only) via its frequencies
method.
Statistics::Sequences has as its submodules other tests of sequencese.g., Wald's Runs test, Schmidt's Pot test (of clustering, bunching of events in a sequence), Kendall's Turns testand supports sharing sequences as data between these tests.
AUTHOR/LICENSE
 Copyright (c) 20062016 Roderick Garton

rgarton AT cpan DOT org
This program is free software. It may be used, redistributed and/or modified under the same terms as Perl5.6.1 (or later) (see http://www.perl.com/perl/misc/Artistic.html).
DISCLAIMER
To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.