Bio::fastAPD -- rapid calculation of average pairwise difference (APD) for multiple sequence alignments
Version 1.10.0
use Bio::fastAPD; my $file_name = "example_data.fasta"; open(my $input_fh, '<', $file_name) or croak ("Could not open $file_name\n"); chomp(my @fasta_lines = <$input_fh>); close $input_fh; # Create an array of aligned sequences my @sequences; my $curr_seq; foreach my $line (@fasta_lines) { if (substr($line, 0, 1) eq ">") { if ($curr_seq) { push(@sequences, $curr_seq) } $curr_seq = (); } else { $curr_seq .= $line } } if ($curr_seq) { push(@sequences, $curr_seq) } my $fastAPD_obj = Bio::fastAPD->new(); $fastAPD_obj->initialize(seq_array_ref => \@sequences, alphabet => 'dna'); my $apd = $fastAPD_obj->apd('gap_base'); my $std_err = $fastAPD_obj->std_err('gap_base'); my $num_reads = $fastAPD_obj->n_reads; my $num_positions = $fastAPD_obj->width; print join("\t", qw(File APD StdErr Positions Reads)), "\n"; print join("\t", $file_name, $apd, $std_err, $num_positions, $num_reads), "\n"; # OR use Bio::fastAPD; use Bio::AlignIO; my $file_name = 'example_data.fasta'; my $alignio_obj = Bio::AlignIO->new( -file => $file_name, -format => 'fasta', -alphabet => 'dna'); my $aln_obj = $alignio_obj->next_aln(); # Create an array of aligned sequences my @sequences; foreach my $seq_obj ($aln_obj->each_seq) { push(@sequences, $seq_obj->seq()) } my $fastAPD_obj = Bio::fastAPD->new(); $fastAPD_obj->initialize(seq_array_ref => \@sequences, alphabet => 'dna'); my $apd = $fastAPD_obj->apd('gap_base'); my $std_err = $fastAPD_obj->std_err('gap_base'); my $num_reads = $fastAPD_obj->n_reads; my $num_positions = $fastAPD_obj->width; print join("\t", qw(File APD StdErr Positions Reads)), "\n"; print join("\t", $file_name, $apd, $std_err, $num_positions, $num_reads), "\n";
The Bio::fastAPD module provides a computationally efficient method for the calculation of average pairwise difference (APD), a measure of nucleotide diversity, from multiple sequence alignment (MSA) data. This module also provides rapid standard error estimation of the APD using an efficient jackknife resampling method. Further description of the methods implemented in this module, including mathematical justification, will be available in an upcoming peer-reviewed journal article.
my $fastAPD_obj = Bio::fastAPD->new();
The initialization subroutine accepts a reference to an array of sequence reads from a multiple sequence alignment. It accepts alphabet designations of 'dna', 'rna' or 'protein'. In order to ignore specific positions in the sequence alignment a binary mask variable may be supplied consisting of a string of 1's (evaluate) and 0's (ignore) of length equal to the length of the alignment. By default, all positions in the alignment are evaluated.
Acceptable characters -
'dna' - ACGT, 'N' for missing base, '-', '~', or '.' for gaps (alignment padding)
'rna' - ACGU, 'N' for missing base, '-', '~', or '.' for gaps (alignment padding)
'protein' - ACDEFGHIKLMNPQRSTVWY, '*' for stop codon, 'X' for missing amino acid, '-', '~', or '.' for gaps (alignment padding)
$fastAPD_obj->initialize(-seq_ref => \@sequences, -alphabet => 'dna', -mask => $mask);
Additional info in the Appendix.
fastAPD_obj->apd('gap_base') # Perform rapid APD calculation. fastAPD_obj->std_err('gap_base') # Estimate standard error of the APD result. fastAPD_obj->gap_threshold() # Set or get max proportion of gap symbols (-,~,.) allowable for a valid position fastAPD_obj->null_threshold() # Set or get max proportion of N symbols allowable for a valid position fastAPD_obj->end_threshold() # Set or get max proportion of ragged end (-,~,.) symbols allowable for a valid position fastAPD_obj->n_reads() # Get the number of reads fastAPD_obj->n_valid_positions() # Get the number of positions in the alignment which meet the analysis criteria fastAPD_obj->valid_positions() # Get the positions in the alignment which meet the analysis criteria fastAPD_obj->width() # Get the width of the alignment fastAPD_obj->freqs() # Get the frequencies of each symbol at each position fastAPD_obj->consensus_alignment() # Get the consensus sequence of the aligned input sequences
Joseph D. Baugher, PhD <joebaugher@hotmail.com> Fernando J. Pineda, PhD <fernando.pineda@jhu.edu>
Joseph D. Baugher, PhD <joebaugher@hotmail.com>
Copyright (c) 2014 Joseph D. Baugher (<joebaugher@hotmail.com>). All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The following documentation describes the Bio::fastAPD module implementation.
new
Title : new Usage : my $fastAPD_obj = Bio::fastAPD->new() Function: Creates a Bio::fastAPD object Returns : A blessed reference Args : No arguments
initialize
Title : initialize Usage : $fastAPD_obj->initialize(seq_ref => \@sequences, alphabet => 'dna', mask => $mask); Function: Initializes a Bio::fastAPD object. 1. Initializes internal variables. 2. Prepares sequence reads for analysis. 3. Counts symbol frequencies. 4. Performs quality checks for erroneous symbols 5. Defines valid positions for analysis Returns : Args : -seq_ref => \@sequences a reference to an array of sequences from an MSA -alphabet => 'dna' Alphabet designations of 'dna', 'rna' or 'protein'. Acceptable characters - 'dna' - 'ACGT', 'N' for missing base, '-', '~', or '.' for gaps(padding) 'rna' - 'ACGU', 'N' for missing base, '-', '~', or '.' for gaps(padding) 'protein' - 'ACDEFGHIKLMNPQRSTVWY', '*' for stop codon, 'X' for missing amino acid, '-', '~', or '.' for gaps (alignment padding) -mask => $mask In order to ignore specific positions in the sequence alignment a mask variable should be created consisting of a string of 1's (evaluate) and 0's (ignore) of length equal to the length of the alignment. By default, all positions in the alignment are evaluated. Calls : _standardize_the_read, _accumulate_symbol_frequencies, _qc_symbols, _define_valid_positions
_standardize_the_read
Title : _standardize_the_read Usage : $standardized_seq_string = _standardize_the_read($seq_string); Function: Replaces ragged end (-) padding symbols with (#) symbols to differentiate between internal gaps. Converts all symbols to uppercase. Returns : A variable containing the standardized sequence Args : A variable containing the input sequence
_accumulate_symbol_frequencies
Title : _accumulate_symbol_frequencies Usage : _accumulate_symbol_frequencies(); Function: Counts and stores frequencies of each symbol at each position. Returns : 1 Args : None
_qc_symbols
Title : _qc_symbols Usage : _qc_symbols(); Function: Performs error checking for acceptable symbols. May carp(warn) or croak(die). Returns : 1 Args : None
_define_valid_positions
Title : _define_valid_positions Usage : _define_valid_positions(); Function: Builds array of valid alignment positions for analysis (below the null, gap, and end thresholds and are not masked). Returns : 1 or undef if no positions are valid Args : None
apd
Title : apd Usage : $apd = $fastAPD_obj->apd('gap_base'); Function: Returns the APD value Returns : APD Args : Comparison algorithm Default = 'gap_base'. Defines acceptable pairwise comparisons 'gap_base' - pairwise deletion - valid comparisons = base to base and gap to base 'base_base' - pairwise deletion - valid comparisons = base to base only 'complete_del' - complete deletion - all positions containing any missing data ignored Calls : _create_alpha_mask, _calculate_apd
_calculate_apd
Title : _calculate_apd Usage : _calculate_apd(); Function: Performs the fast APD calculation - nucleotide diversity per pair of bases (ratio of sums). APD ($d) = Total Mismatches ($m) / Total Pairwise comparisons ($p) where $m = $p - Total Matches (%matches) where matches = binomial coefficient(frequency of a given symbol, 2) summed over appropriate symbols and valid positions where $p = binomial coefficient(number of rows at a given position, 2) summed over valid positions Returns : The APD value ($d) Args : None Calls : _create_alpha_mask, choose_2
std_err
Title : std_err Usage : $se = $fastAPD_obj->std_err('gap_base'); Function: Iterates over the number of reads ($_K) adjusting the original frequency counts to ignore one read at each iteration. Calls an APD subroutine at each iteration and stores the diversity values. Calls _jackknifeSE to calculate the standard error. Returns : The standard error of the APD ($_seD) as estimated by jackknife resampling. Args : Comparison method Default = 'gap_base'. Calls : _calculate_apd, _jackknifeSE
_create_alpha_mask
Title : _create_alpha_mask Usage : _create_alpha_mask(); Function: Creates a binary mask (1 = valid for pairwise comparisons; 0 = invalid) corresponding to all symbols, dependent on the comparison_type input. Calls _define_valid_positions in case the valid positions should be refined for complete_del comparison type. Returns : 1 Args : None Calls : _define_valid_positions
_choose_2
Title : _choose_2 Usage : _choose_2($number_of_pairs); Function: A simplified binomial coefficient calculation - the number of ways of choosing k outcomes from n possibilities, where k = 2. Returns : the value of the binomial coefficient Args : the number of possibilities, n
_jackknifeSE
Title : _jackknifeSE Usage : _jackknifeSE(@diversity_values); Function: Calculates the standard error based on jackknife resampling Returns : the standard error Args : an array of values obtained from the iterative jackknife leave-one-out process Calls : _mean
_mean
Title : _mean Usage : _mean(@values); Function: Calculates the mean of an array of values Returns : the mean Args : an array of values
gap_threshold
Title : gap_threshold Usage : $fastAPD_obj->gap_threshold(0.1) or $fastAPD_obj->gap_threshold(); Function: If an argument is provided, this function sets the gap threshold. Returns the gap threshold. Returns : the gap threshold Args : optional value for setting the gap threshold
null_threshold
Title : null_threshold Usage : $fastAPD_obj->null_threshold(0.1) or $fastAPD_obj->null_threshold(); Function: If an argument is provided, this function sets the null threshold. Returns the null threshold. Returns : the null threshold Args : optional value for setting the null threshold
end_threshold
Title : end_threshold Usage : $fastAPD_obj->end_threshold(0.1) or $fastAPD_obj->end_threshold(); Function: If an argument is provided, this function sets the ragged end threshold. Returns the ragged end threshold. Returns : the ragged end threshold Args : optional value for setting the ragged end threshold
n_reads
Title : n_reads Usage : $fastAPD_obj->n_reads(); Function: Returns the number of reads in the input alignment Returns : the number of reads Args : none
n_valid_positions
Title : n_valid_positions Usage : $fastAPD_obj->n_valid_positions(); Function: Returns the number of positions in the alignment which meet the analysis criteria Returns : the number of valid positions Args : none
valid_positions
Title : valid_positions Usage : $fastAPD_obj->valid_positions(); Function: Returns an array reference to the positions in the alignment which meet the analysis criteria Returns : an array reference to the valid positions Args : none
width
Title : width Usage : $fastAPD_obj->width(); Function: Returns the width of the alignment Returns : the width of the alignment Args : none
freqs
Title : freqs Usage : $fastAPD_obj->freqs(); Function: Returns frequencies of each symbol at each position Returns : a reference to an array containing frequencies of each symbol at each position Args : none
consensus_alignment
Title : consensus_alignment Usage : $fastAPD_obj->consensus_alignment(); Function: Returns a consensus sequence based on the aligned reads Returns : a consensus sequence based on the aligned reads Args : none Calls : _calculate_aligned_consensus
_calculate_aligned_consensus
Title : _calculate_aligned_consensus Usage : _calculate_aligned_consensus(); Function: Creates a consensus sequence by calling _argmax to detect the most frequent symbol at each position in the alignment Returns : a consensus sequence based on the aligned reads Args : none Calls : _argmax
_argmax
Title : _argmax Usage : _argmax($_freq[$i]); Function: Detects the most frequently occurring symbol at a position Returns : the most frequently occurring symbol Args : a hash containing the frequencies of each symbol at a given position
To install Bio::fastAPD, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::fastAPD
CPAN shell
perl -MCPAN -e shell install Bio::fastAPD
For more information on module installation, please visit the detailed CPAN module installation guide.