The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Statistics::Gtest - calculate G-statistic for tabular data

SYNOPSIS

   use Statistics::Gtest;

   $gt = Statistics::Gtest->new($data);
    
    $degreesOfFreedom = $gt->getDF();
    $gstat = $gt->getG();
    
    $gt->setExpected($expectedvalues);
    $uncorrectedG = $gt->getRawG();
    

DESCRIPTION

Statistics::Gtest is a class that calculates the G-statistic for goodness of fit for frequency data. It can be used on simple frequency distributions (1-way tables) or for analyses of independence (2-way tables).

Note that Statistics::Gtest will not, by itself, perform the significance test for you -- it just provides the G-statistic that can then be compared with the chi-square distribution to determine significance.

OVERVIEW and EXAMPLES

A goodness of fit test attempts to determine if an observed frequency distribution differs significantly from a hypothesized frequency distribution. From Statistics::Gtest's point of view, these tests come in two flavors: 1-way tests (where a single frequency distribution is tested against an expected distribution) and 2-way tests (where a matrix of observed values is tested for independence -- that is, the lack of interaction effects among the two axes being measured).

A simple example might help here. You've grown 160 plants from seed produced by a single parent plant. You observe that among the offspring plants, some have spiny leaves, some have hairy leaves, and some have smooth leaves. What is the likelihood that the distribution of this trait follows the expected values for simple Mendelian inheritance?

 Observed values:
   Spiny Hairy Smooth
     95    53    12

 Expected values (for a 9:3:3:1 ratio):
     90    60    10

If the observed and expected values are put into two files, Statistics::Gtest can create a G-statistic object that will calculate the likelihood that the observed distribution is significantly different from the distribution that would be expected by simple inheritance. (The value of G for this comparison is approximately 1.495, with 2 degrees of freedom; the observed results are not significantly different from expected at the .05 -- or even .1 level.)

2-way tests will usually not need a table of expected values, as the expected values are generated from the observed value sums. However, one can be loaded for 2-way tables as well.

To determine if the calculated G statistic indicates a statistically significant result, you will need to look up the values in a chi-square distribution on your own, or make use of the Statistics::Distributions module:

 use Statistics::Gtest;
 use Statistics::Distributions;

 ...

 my $gt = Statistics::Gtest->new($data);
 my $df = $gt->getDF();
 my $g = $gt->getG();
 my $sig = '.05';
 my $chis=Statistics::Distributions::chisqrdistr ($df,$sig);
 if ($g > $chis) {
   print "$g: Sig. at the $sv level. ($chis cutoff)\n"
 } 

By default, Statistics::Gtest returns a G statistic that has been modified by William's correction (Williams 1976). This correction reduces the value of G for smaller sample sizes, and has progressively less effect as the sample size increases. The raw, uncorrected G statistic is also available.

References

  • Sokal, R.R., and F.J. Rohlf, Biometry. 1981. W.H. Freeman and Company, San Francisco.

  • Williams, D.A. 1976. Improved likelihood ratio test for complete contingency tables. Biometrika, 63:33 - 37.

Public Methods

Constructor

   $g = Statistics::Gtest->new($data);
   $g = new Statistics::Gtest($data);

$data can be in several formats. All of the following are valid:

 * whitespace-delimited string:          "95 53 12"  
 * reference to 1-dimensional array:     [ 95, 53, 12 ]   
 * reference to 2-dimensional array:     [ [ 10, 20 ], [ 20, 15 ] ]
 * external file or filehandle reference.

Data in files must be arranged into rows and columns, separated by whitespace. In all cases, must be no non-numeric characters, no empty cells, and no zero counts. Arrays are not valid input.

getG

   $float = $g->getG();

Returns the corrected G-statistic for the current observed and expected frequency counts.

getRawG

   $float = $g->getRawG();

Returns the uncorrected G-statistic for the current observed and expected frequency counts. This value can be misleadingly large for small sample sizes (n < 200).

getQ

   $float = $g->getQ();

Returns Williams' correction (q) for this test. (See explanation in 'Overview and Examples'.)

getObserved

   $arrayref = $g->getObserved();

Returns an array reference containing the observed cell values. The array is formatted in the same row-column layout as the input data.

getExpected

   $arrayref = $g->getExpected();

Returns an array reference containing the expected cell values. The array is formatted in the same row-column layout as the observed data.

setExpected

   $g->setExpected($string);
   $g->setExpected($arrayref);
   $g->setExpected($filename);
   $g->setExpected($filehandle);

If testing with a specific hypothesized distribution, the expected frequency values for that distribution, given the total sample size, must be input to Statistics::Gtest. The input data has the same contraints on format as does the initial data.

getDF

   $integer = $g->getDF();

Returns the current degrees of freedom for this distribution, which is calculated automatically from the observed data (rows - 1 for 1-way tests, (rows - 1) * (cols - 1) for 2-way tests).

setDF

   $g->setDF($integer);

Sets the degrees of freedom for this distribution. Sometimes this value needs to be modified beyond the standard rules used by Statistics::Gtest; setDF makes this possible.

getRow

   $rowref = $g->getRow(rownum);

Returns a row from the array of observed data. Row numbering is zero-based.

getCol

   $colref = $g->getCol(colnum);

Returns a column for the array of observed data. Column numbering is zero-based.

rowSum

   $integer = $g->rowSum($index);

Returns the sum of the requested row.

colSum

   $integer = $g->colSum($index);

Returns the sum of the requested column.

getRowNum

   $integer = $g->getRowNum();

Returns the number of rows in the data table.

getColNum

   $integer = $g->getColNum();

Returns the number of columns in the data table.

getSumTotal

   $integer = $g->getSumTotal();

Returns the total number of observations.