Jimi Carlo Wills
and 1 contributors


Bio::MaxQuant::Evidence::Statistics - Additional statistics on your SILAC evidence


Version 0.01


Read/convert your evidence file to a more rapidly processable format, and perform various operations and statistics across/between multiple experiments. Supports multidimensional experiments with replicate analyses.

    use Bio::MaxQuant::Evidence::Statistics;

    my $foo = Bio::MaxQuant::Evidence::Statistics->new();
    # get the essential data from an evidence file

    # store the essentials for later

        # laod previously stored essentials



Create a new object:

    my $foo = Bio::MaxQuant::Evidence::Statistics->new();


Reads the essential data from an evidence file. Evidence files for large analyses can be very big and take a long time to process, to we only read what's necessary, and can save this for convenience and speed too, using writeEssentials().

The data are stored by Protein group IDs, i.e. one entry per protein group. Other data stored here are:

Protein group IDs
Modified -- is this actually the right name??
Leading Proteins
Ratio H/L
Intensity H
Intensity L

The column names used for storage are defined in the default option essential_column_names, and can be changed when you call new, or when you call parseEssentials. The option is a hash of column names whose values detmerine whether the column is kept by their truthness... e.g.

        'id'  => 1, # kept
        'PEP' => 0, # discarded
        #foo  => ?, # discarded

If a column doesn't exist, it does not complain!

The method takes a hash of options.


filename - path of the file to process
separator - passed to Text::CSV (default is tab)
key_column_name - change the column keyed on (default is id)
experiment_column_name - change the column the data are split on
list_column_names - change the columns stored as lists


Some columns are the same across all the evidence in a protein group, eg, the id is obviously the same, Contaminant and Reverse, Protein IDs, and so on. The default, therefore, is to overwrite the column with the value seen in an evidence. BUT, some columns have a different value in each evidence, e.g. Ratio H/L or PEP. Whatever columns are given in list_column_names, which true values, will be appended as lists, so in the final data, there will be one row per protein and any column bearing multiple evidences for that protein will be a list.

If that makes no sense, write to me and I'll try to change it.


Returns a list of the experiments in the data.


Returns a list of the experiment names without the replicate portion.

The names are assumed to be Cell.Condition.Replicate, i.e. full-stop (period) separated.


Returns a list of sets of orthogonal experiments, that is 3 experiments in which the first has one condition in common with the other two, but they have nothing in common with each other.

e.g. A.X A.Y B.X

The rationale behind this is that quantitative differences across this set indicate mechanistic links between, for example, cell line and drug treatment. If a reponse is seen to a drug, and a different repsonse is seen in a different cell-type, this system will pick that up. The fourth member of the comparison (in the example that would be B.Y) could be anything... and the interpretation would still be that there is a differential response.


Returns a list of pairs of replicated experiments (e.g. A.X A.Y, A.X B.X ...) that represents all possible comparisons.


Returns a list of evidence ids in the data.


Returns a list containing the ids of those evidences shared between protein groups.


Returns a list containing the ids of those evidences unique to one protein group.


Save the essential data (quicker to read again in future)


Load up previously saved essentials






Logs ratios (base 2) throughout the dataset, and sets a flag so it can't get logged again.

Treatment of "special values": empty string, <= 0, NaN, and any other non-number are removed from the data!


returns a set of protein records based on filter parameters...


experiment - regular expression to match experiment name
proteinGroupId - regular expression to match protein group id
leadingProteins - regular expression to match leading protein ids
notLeadingProteins - regular expression to not match leading protein ids

Returns a filtered object of the same type, with relevant flags set (e.g. whether data has been logged, etc).

Warning, intentionally does not perform a deep clone!


options are passed to filter.


returns an hashref with the following keys

n - the number of items
sd - the standard deviation (from the mean)
mad - the median absolute deviation (from the median)
sd_via_mad - the standard deviation estimated from the median absolute deviation


given a list of values, returns the mean

sd (unbiased standard deviation)

given a list of values, returns a hash with keys mean and sd (standard deviation).


given a list of values, returns the sum


given a list of values, returns the median absolute deviation


Given options, experiment1, experiment2 and optional filters, returns a hash of statistics...

stats1 and stats2 are hashes of deviations: sd, mad, sd_via_mad, usv, n, values

ttest is hash of Welch's ttest results: t, df, p

ttest_mad is like ttest but based on median and median absolute deviateions.

The p-values are derived using Welch's Ttest and the t-distribution function from Statistics::Distributions.

MAD and medians are much more robust to outliers, which are significant in peptide ratios.


performs Welch's ttest, given mean1, mean2, usv1, usv2, n1 and n2 in a hash.


    $o->welchs_ttest( mean1 => 4, mean2 => 3,  # sample mean
                      usv1 => 1,  usv2 => 1.1, # unbiased sample variance (returned as usv from $o->sd
                      n1 => 4,    n2=> 7       # number of observations

also performs Welch-Satterthwaite to calculate degrees of freedom (to look up in t-statistic table)

Returns hashref containing t and df.


Logs data, if not already done, calculates median for each replicate, and subtracts median from each evidence in that replicate.


given a list of numbers, returns the median... assumes all items are numbers!



Does a full comparison on a particular protein, i.e. compares all pairs of conditions, also does differential response analysis. Allows limitation of analysis to proteotypic peptides.


Does a full comparison for each protein. Returns hash of hashes.


given two values, returns whether the different between first and second is positive or negative

returns '+' or '-'


given two directions, which could be '+', '-' or '', returns true if one is '+' and the other is '-'


jimi, <j at 0na.me>


Please report any bugs or feature requests to bug-bio-maxquant-evidence-statistics at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-MaxQuant-Evidence-Statistics. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.


You can find documentation for this module with the perldoc command.

    perldoc Bio::MaxQuant::Evidence::Statistics

You can also look for information at:



Copyright 2014 jimi.

This program is released under the following license: artistic2