NAME
Bio::fastAPD -- rapid calculation of average pairwise difference (APD) for multiple sequence alignments
VERSION
Version 1.01
SYNOPSIS
use Bio::fastAPD;
my $file_name = "example_data.fasta";
open(my $input_fh, '<', $file_name) or croak ("Could not open $file_name\n");
chomp(my @fasta_lines = <$input_fh>);
close $input_fh;
# Create an array of aligned sequences
my @sequences;
my $curr_seq;
foreach my $line (@fasta_lines) {
if (substr($line, 0, 1) eq ">") {
if ($curr_seq) { push(@sequences, $curr_seq) }
$curr_seq = ();
}
else { $curr_seq .= $line }
}
if ($curr_seq) { push(@sequences, $curr_seq) }
my $fastAPD_obj = fastAPD->new();
$fastAPD_obj->initialize(seq_array_ref => \@sequences,
alphabet => 'dna');
my $apd = $fastAPD_obj->calculate_diversity(method => 'fast_apd',
compare => 'gap_base');
my $std_err = $fastAPD_obj->calculate_apd_std_err(method => 'fast_apd',
compare => 'gap_base');
my $num_reads = $fastAPD_obj->n_reads;
my $num_positions = $fastAPD_obj->width;
print join("\t", qw(File APD StdErr Positions Reads)), "\n";
print join("\t", $file_name, $apd, $std_err, $num_positions, $num_reads), "\n";
# OR
use Bio::fastAPD;
use Bio::AlignIO;
my $file_name = 'example_data.fasta';
my $alignio_obj = Bio::AlignIO->new( -file => $file_name,
-format => 'fasta',
-alphabet => 'dna');
my $aln_obj = $alignio_obj->next_aln();
# Create an array of aligned sequences
my @sequences;
foreach my $seq_obj ($aln_obj->each_seq) { push(@sequences, $seq_obj->seq()) }
my $fastAPD_obj = fastAPD->new();
$fastAPD_obj->initialize(seq_array_ref => \@sequences,
alphabet => 'dna');
my $apd = $fastAPD_obj->calculate_diversity(method => 'fast_apd',
compare => 'gap_base');
my $std_err = $fastAPD_obj->calculate_apd_std_err(method => 'fast_apd',
compare => 'gap_base');
my $num_reads = $fastAPD_obj->n_reads;
my $num_positions = $fastAPD_obj->width;
print join("\t", qw(File APD StdErr Positions Reads)), "\n";
print join("\t", $file_name, $apd, $std_err, $num_positions, $num_reads), "\n";
DESCRIPTION
The Bio::fastAPD module provides computationally efficient methods for the calculation of average pairwise difference (APD), a measure of nucleotide diversity, from multiple sequence alignment (MSA) data. This module also provides rapid standard error estimation of the APD using an efficient jackknife resampling method. Further description of the methods implemented in this module, including mathematical justification, will be available in an upcoming peer-reviewed journal article.
CONSTRUCTOR
my $fastAPD_obj = fastAPD->new();
INITIALIZER
The initialization subroutine accepts a reference to an array of sequence reads from a multiple sequence alignment. It accepts alphabet designations of 'dna', 'rna' or 'protein'. In order to ignore specific positions in the sequence alignment a binary mask variable may be supplied consisting of a string of 1's (evaluate) and 0's (ignore) of length equal to the length of the alignment. By default, all positions in the alignment are evaluated.
Acceptable characters -
'dna' - ACGT, 'N' for missing base, '-', '~', or '.' for gaps (alignment padding)
'rna' - ACGU, 'N' for missing base, '-', '~', or '.' for gaps (alignment padding)
'protein' - ACDEFGHIKLMNPQRSTVWY, '*' for stop codon, 'X' for missing amino acid, '-', '~', or '.' for gaps (alignment padding)
$fastAPD_obj->initialize(-seq_ref => \@sequences,
-alphabet => 'dna',
-mask => $mask);
OBJECT METHODS
Additional info in the Appendix.
fastAPD_obj->calculate_diversity(-method => 'fast_apd',
-compare => 'gap_base')
# Perform specified APD calculation.
fastAPD_obj->calculate_apd_std_err(-method => 'fast_apd',
-compare => 'gap_base')
# Estimate standard error of the specified APD calculation.
fastAPD_obj->gap_threshold()
# Set or get max proportion of gap symbols (-,~,.) allowable for a valid position
fastAPD_obj->null_threshold()
# Set or get max proportion of N symbols allowable for a valid position
fastAPD_obj->end_threshold()
# Set or get max proportion of ragged end (-,~,.) symbols allowable for a valid position
fastAPD_obj->n_reads()
# Get the number of reads
n_valid_positions
# Get the number of positions in the alignment which meet the analysis criteria
valid_positions
# Get the positions in the alignment which meet the analysis criteria
fastAPD_obj->width()
# Get the width of the alignment
fastAPD_obj->freqs()
# Get the frequencies of each symbol at each position
fastAPD_obj->consensus_alignment()
# Get the consensus sequence of the aligned input sequences
AUTHORS
Joseph D. Baugher, PhD <jbaughe2@jhmi.edu> Fernando J. Pineda, PhD <fernando.pineda@jhu.edu>
MAINTAINER
Joseph D. Baugher, PhD <jbaughe2@jhmi.edu>
LICENSE AND COPYRIGHT
Copyright (c) 2014 Joseph D. Baugher (<jbaughe2@jhmi.edu>). All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
APPENDIX
The following documentation describes the fastAPD module implementation.
new
Title : new
Usage : $fastAPD_obj->new();
Function: Creates a fastAPD object
Returns : A blessed reference
Args : No arguments
initialize
Title : initialize
Usage : $fastAPD_obj->initialize(seq_ref => \@sequences,
alphabet => 'dna',
mask => $mask);
Function: Initializes a fastAPD object.
1. Initializes internal variables.
2. Prepares sequence reads for analysis.
3. Counts symbol frequencies.
4. Performs quality checks for erroneous symbols
5. Defines valid positions for analysis
Returns :
Args : -seq_ref => \@sequences
a reference to an array of sequences from an MSA
-alphabet => 'dna'
Alphabet designations of 'dna', 'rna' or 'protein'.
Acceptable characters -
'dna' - 'ACGT', 'N' for missing base, '-', '~', or '.' for gaps(padding)
'rna' - 'ACGU', 'N' for missing base, '-', '~', or '.' for gaps(padding)
'protein' - 'ACDEFGHIKLMNPQRSTVWY', '*' for stop codon, 'X' for missing
amino acid, '-', '~', or '.' for gaps (alignment padding)
-mask => $mask
In order to ignore specific positions in the sequence alignment a mask
variable should be created consisting of a string of 1's (evaluate) and 0's
(ignore) of length equal to the length of the alignment. By default, all
positions in the alignment are evaluated.
Calls : _standardize_the_read, _accumulate_symbol_frequencies, _qc_symbols,
_define_valid_positions
_standardize_the_read
Title : _standardize_the_read
Usage : $standardized_seq_string = _standardize_the_read($seq_string);
Function: Replaces ragged end (-) padding symbols with (#) symbols to differentiate
between internal gaps. Converts all symbols to uppercase.
Returns : A variable containing the standardized sequence
Args : A variable containing the input sequence
_accumulate_symbol_frequencies
Title : _accumulate_symbol_frequencies
Usage : _accumulate_symbol_frequencies();
Function: Counts and stores frequencies of each symbol at each position.
Returns : 1
Args : None
_qc_symbols
Title : _qc_symbols
Usage : _qc_symbols();
Function: Performs error checking for acceptable symbols. May carp(warn) or croak(die).
Returns : 1
Args : None
_define_valid_positions
Title : _define_valid_positions
Usage : _define_valid_positions();
Function: Builds array of valid alignment positions for analysis (below the null, gap,
and end thresholds and are not masked).
Returns : 1 or undef if no positions are valid
Args : None
calculate_diversity
Title : calculate_diversity
Usage : $apd = $fastAPD_obj->calculate_diversity(method => 'fast_apd',
compare => 'gap_base');
Function: Returns appropriate diversity result
Returns : APD value ($_D)
Args : method => 'fast_apd' or 'apd_by_pos'
Default = 'fast_apd'.
compare => 'gap_base'
Default = 'gap_base'.
Defines acceptable pairwise comparisons
'gap_base' - pairwise deletion
- valid comparisons = base to base and gap to base
'base_base' - pairwise deletion
- valid comparisons = base to base only
'gap_gap' - pairwise deletion
- valid comparisons = base to base, gap to base, gap to gap
'complete_del' - complete deletion
- all positions containing any missing data ignored
Calls : _create_alpha_mask, _fast_apd or _apd_by_pos
calculate_apd_std_err
Title : calculate_apd_std_err
Usage : $se = $fastAPD_obj->calculate_apd_std_err(method => 'fast_apd',
compare => 'gap_base');
Function: Iterates over the number of reads ($_K) adjusting the original frequency
counts to ignore one read at each iteration. Calls an APD subroutine at each
iteration and stores the diversity values. Calls _jackknifeSE to calculate
the standard error.
Returns : The standard error of the APD ($_seD) as estimated by jackknife resampling.
Args : method => 'fast_apd' or 'apd_by_pos'
Default = 'fast_apd'.
compare => 'gap_base'
Default = 'gap_base'.
Calls : _create_alpha_mask, _fast_apd or _apd_by_pos, _mean, _jackknifeSE
_fast_apd
Title : _fast_apd
Usage : _fast_apd();
Function: Performs the default APD calculation - nucleotide diversity per pair of bases
(ratio of sums).
APD ($_d) = Total Mismatches ($_M) / Total Pairwise comparisons ($_P)
where $_M = $_P - Total Matches (%matches)
where matches = binomial coefficient(frequency of a given symbol, 2)
summed over appropriate symbols and valid positions
where $_P = binomial coefficient(number of rows at a given position, 2)
summed over valid positions
The results are corrected for gap-gap comparisons depending on the comparison
type input.
Returns : The diversity value ($_d)
Args : None
Calls : _choose_2
_apd_by_pos
Title : _apd_by_pos
Usage : _apd_by_pos();
Function: Performs an alternative fast APD calculation - nucleotide diversity per
position (sum of ratios).
APD ($_d) = $sum_d / number of valid positions
where $sum_d = (pairs ($p) - matches) / $p, summed over each position
where matches = binomial coefficient(frequency of a given symbol, 2)
at a given position
where $_p = binomial coefficient(number of rows at a given position, 2)
at a given position
The results are corrected for gap-gap comparisons depending on the comparison
type input.
Returns : The diversity value ($_d)
Args : None
Calls : _choose_2
_create_alpha_mask
Title : _create_alpha_mask
Usage : _create_alpha_mask();
Function: Creates a binary mask (1 = valid for pairwise comparisons; 0 = invalid)
corresponding to all symbols, dependent on the comparison_type input.
Calls _define_valid_positions in case the valid positions should be
refined for complete_del comparison type.
Returns : 1
Args : None
Calls : _choose_2, _define_valid_positions
_choose_2
Title : _choose_2
Usage : _choose_2($number_of_pairs);
Function: A simplified binomial coefficient calculation - the number of ways of
choosing k outcomes from n possibilities, where k = 2.
Returns : the value of the binomial coefficient
Args : the number of possibilities, n
_jackknifeSE
Title : _jackknifeSE
Usage : _jackknifeSE(@diversity_values);
Function: Calculates the standard error based on jackknife resampling
Returns : the standard error
Args : an array of values obtained from the iterative jackknife leave-one-out process
Calls : _mean
_mean
Title : _mean
Usage : _mean(@values);
Function: Calculates the mean of an array of values
Returns : the mean
Args : an array of values
gap_threshold
Title : gap_threshold
Usage : $fastAPD_obj->gap_threshold(0.1) or $fastAPD_obj->gap_threshold();
Function: If an argument is provided, this function sets the gap threshold.
Returns the gap threshold.
Returns : the gap threshold
Args : optional value for setting the gap threshold
null_threshold
Title : null_threshold
Usage : $fastAPD_obj->null_threshold(0.1) or $fastAPD_obj->null_threshold();
Function: If an argument is provided, this function sets the null threshold.
Returns the null threshold.
Returns : the null threshold
Args : optional value for setting the null threshold
end_threshold
Title : end_threshold
Usage : $fastAPD_obj->end_threshold(0.1) or $fastAPD_obj->end_threshold();
Function: If an argument is provided, this function sets the ragged end threshold.
Returns the ragged end threshold.
Returns : the ragged end threshold
Args : optional value for setting the ragged end threshold
n_reads
Title : n_reads
Usage : $fastAPD_obj->n_reads();
Function: Returns the number of reads in the input alignment
Returns : the number of reads
Args : none
n_valid_positions
Title : n_valid_positions
Usage : $fastAPD_obj->n_valid_positions();
Function: Returns the number of positions in the alignment which meet the analysis
criteria
Returns : the number of valid positions
Args : none
valid_positions
Title : valid_positions
Usage : $fastAPD_obj->valid_positions();
Function: Returns an array reference to the positions in the alignment which meet
the analysis criteria
Returns : an array reference to the valid positions
Args : none
width
Title : width
Usage : $fastAPD_obj->width();
Function: Returns the width of the alignment
Returns : the width of the alignment
Args : none
freqs
Title : freqs
Usage : $fastAPD_obj->freqs();
Function: Returns frequencies of each symbol at each position
Returns : a reference to an array containing frequencies of each symbol at each position
Args : none
consensus_alignment
Title : consensus_alignment
Usage : $fastAPD_obj->consensus_alignment();
Function: Returns a consensus sequence based on the aligned reads
Returns : a consensus sequence based on the aligned reads
Args : none
Calls : _calculate_aligned_consensus
_calculate_aligned_consensus
Title : _calculate_aligned_consensus
Usage : _calculate_aligned_consensus();
Function: Creates a consensus sequence by calling _argmax to detect the most
frequent symbol at each position in the alignment
Returns : a consensus sequence based on the aligned reads
Args : none
Calls : _argmax
_argmax
Title : _argmax
Usage : _argmax();
Function: Detects the most frequently occurring symbol at a position
Returns : the most frequently occurring symbol
Args : a hash containing the frequencies of each symbol at a given position