Jimi Carlo Wills
and 1 contributors

# NAME

Statistics::Reproducibility - Reproducibility measurement between multiple replicate experiments

Version 0.03

# SYNOPSIS

This module facilitates investigation of reproducbility between multiple replicates of quantitative experiments e.g. SILAC or microarray. Scatter plots are great, but only 2d. Some people use correlation as a proxy for reproducibility, but it's not right. This module helps you through the following items...

1) Summarize reproducibility across the replicates 2) Pick out replicates that agree more or less 3) Summarize reproducibility for individual proteins/genes/whatever 4) Set a cutoff for what you can call significant, based on precision 5) Deal with missing values (common in SILAC)

This works by going through the following steps:

(0) Choose a dataset to compare everything else to (the middlemost) 1) Put the middle of the data at (0,0,0,0...) by subtracting the median ... report the median 2) Rotate the data so the line x=y=z=... lies on a single axis. The data should be spread along this axis. 3) Do regression on the data and work out "wrongness" of each replicate (!) 4) Calculate and report ratio variance and imprecision variance 5) Report combined ratio and error for each protein/gene/whatever

Perhaps a little code snippet.

use Statistics::Reproducibility;

my \$r = Statistics::Reproducibility->new();

# SUBROUTINES/METHODS

## derive

derives a new object from an old one... some fields are conserved. Warning: references are copied, so m and c point to the same arrays! However, if you run regression() again, they will point to new arrays. Data is set up with k empty columns.

## data

Set the data. Should be rectangular, i.e. all columns the same length, and we'll check it is and croak if not... but can contain "empty" cells (empty string), which represent missing values in the data.

returns the object for chaining.

## run

runs a recommended workflow. it's a shortcut for:

my \$m = \$r->subtractMedian();
\$m->middlemostColumn();
my \$d = \$m->deDiagonalize();
\$d -> regression();
my \$e = \$d->rotateToRegressionLine();
\$e->variances();

It returns the last object. So you could do:

my \$results = Statistics::Reproducibility
->new()
->data(\$mydata)
->run()
->printableTable(\$depth);

## subtractMedian

calculates the median for each column, substracts from each column and returns the new object.

## middlemostColumn

Calculates which of the columns is middlemost and remembers it so all others are compared to it. This can be done instead of using a constructed median dataset as the comparator so that the constructed one does not add to the spread, and does not contribute to the observation count.

Note: the method of scoring the columns involves counting which has the most middlemost values. For two columns only, the result will always be the one with the least missing values. I don't think there's anything wrong with that, but just so you know!

## constructMedianLeft

Make a median column and pop it on the left. Note that the regular median is used here, not the Quick Median estimator. This means that for an even number of observations, the mean of the two middlemost is used, which is not the case for Quick Median.

## deDiagonalize

Replicated data with some spread will naturally lie along the diagonal line, y=x (in 2 dimensions, or z=y=x... in more). This function aligns the data along one axis by rotation. This is done so that (a) errors are measured approximately perpendicular to the spread of data and (b) unspread data (a ball of points) gives a gradient of zero in Theil Sen estimator, which is correct because if there's no experimental spread then there can be no evidence that the replicates disagree.

Note: at this point, any missing values are REPLACED BY ZEROS! This means that these data point will not disagree with any "unchanging" data, but they will not support the reproducibility of "changed" data (data for proteins/genes) that are regulated). The effect of this is that those points will not appear as extreme in the output and will also have a larger error associated with them.

A NEW object is returned! comparatorIndex is honoured and conserved, meaning that if you ran middlemostColumn, the result is the column used as the Y axis in all comparisons, and the column itself will contain the experimental variance.

## regression

Perform Theil Sen Estimator regression on the data. The regression is done with the comparator on the x axis, but the symmetric parameters are returned for the comparator on the y-axis.

do we need this?

## variances

Calculate variances... i.e. distances from the origin along the line of regression, and distances from the line of regression. This is just like deDiagonalise, except that only two columns are returned.

## printableTable, printTable

printableTable returns all available relevant info in a table printTable prints all available relevant info in a table

the firts element returned is a list of columns. the rest are the columns.

data stored are:

# scalars:
comparatorIndex             # index of column used to compare
k
n
vE                          # variance of "error" (imprecision)
vS                          # variable of experimental spread
sdE                         # s.d. error

# arrays (foreach column)
m                           # regression denominator
c                           # regression constant
# arrays (foreach row)
d                           # distance from regression line
pee                         # p-value of error
pes                         # p-value of error over spread (??)
pse                         # p-value of spread over error

# 2D array (LoL)
data

note that the distance from the center of the distribution
is given by the values in data[comparatorIndex]

These methods take a single argumen: depth. Every time an object is derived from another (subtractMedian, deDiagonalize and rotateToRegressionLine all do this) the old object is referenced, and you can include the last \$depth objects in the output. Set depth to -1 to include all objects.

# just some wee helper functions...

## median

yes this probably exists in other modules, but I didn't want to pull in a whole module for just one funciton. Anyway, this is an inefficient version for small numbers of data. It sorts the list and then uses middle() to find the middle of the sorted list.

## medianN

Like median, but for an even list is returns the two middlemost values. This version is used in medianI.

## medianI

This uses medianN to get the middlemost value(s) and then returns a list of column indices indicating which columns had a middlemost value. This is used in the medianLeft method when deciding which column is middlemost.

## middle

middle returns the middlemost item in a list, or the mean average of the two middlemost items. It doesn't sort the list first.

## middleN

middleN does like middle, but for even lists, it returns the two middlemost items as a list. This is used by medianN.

# AUTHOR

Jimi Wills, <jimi at webu.co.uk>

# BUGS

Please report any bugs or feature requests to bug-statistics-reproducibility at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Statistics-Reproducibility. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

# SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Statistics::Reproducibility

You can also look for information at:

# ACKNOWLEDGEMENTS

This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at: