NAME

Bio::fastAPD -- rapid calculation of average pairwise difference (APD) for multiple sequence alignments

VERSION

Version 1.10.0

SYNOPSIS

    use Bio::fastAPD;
	
    my $file_name = "example_data.fasta";
    open(my $input_fh, '<', $file_name) or croak ("Could not open $file_name\n");
    chomp(my @fasta_lines = <$input_fh>);
    close $input_fh;

    # Create an array of aligned sequences
    my @sequences;
    my $curr_seq;
    foreach my $line (@fasta_lines) {
        if (substr($line, 0, 1) eq ">") { 
            if ($curr_seq) { push(@sequences, $curr_seq) }
            $curr_seq = ();
        }
        else { $curr_seq .= $line }
    }
    if ($curr_seq) { push(@sequences, $curr_seq) }       
    
    my $fastAPD_obj = Bio::fastAPD->new();
    $fastAPD_obj->initialize(seq_array_ref => \@sequences,
                             alphabet      => 'dna');

    my $apd = $fastAPD_obj->apd('gap_base');
                                            
    my $std_err = $fastAPD_obj->std_err('gap_base');

    my $num_reads     = $fastAPD_obj->n_reads;                                      
    my $num_positions = $fastAPD_obj->width;        
   
    print join("\t", qw(File APD StdErr Positions Reads)), "\n";      
    print join("\t", $file_name, $apd, $std_err, $num_positions, $num_reads), "\n";  

    # OR

    use Bio::fastAPD;
    use Bio::AlignIO;

    my $file_name = 'example_data.fasta';
    my $alignio_obj = Bio::AlignIO->new( -file     => $file_name,
                                         -format   => 'fasta',
                                         -alphabet => 'dna');
    my $aln_obj = $alignio_obj->next_aln();
    
    # Create an array of aligned sequences
    my @sequences;  
    foreach my $seq_obj ($aln_obj->each_seq) { push(@sequences, $seq_obj->seq()) }
        
    my $fastAPD_obj = Bio::fastAPD->new();
    $fastAPD_obj->initialize(seq_array_ref => \@sequences,
                             alphabet      => 'dna');
        
    my $apd = $fastAPD_obj->apd('gap_base');
                                                
    my $std_err = $fastAPD_obj->std_err('gap_base');
   
    my $num_reads     = $fastAPD_obj->n_reads;                                      
    my $num_positions = $fastAPD_obj->width;        
          
    print join("\t", qw(File APD StdErr Positions Reads)), "\n";      
    print join("\t", $file_name, $apd, $std_err, $num_positions, $num_reads), "\n";  
        

DESCRIPTION

The Bio::fastAPD module provides a computationally efficient method for the calculation of average pairwise difference (APD), a measure of nucleotide diversity, from multiple sequence alignment (MSA) data. This module also provides rapid standard error estimation of the APD using an efficient jackknife resampling method. Further description of the methods implemented in this module, including mathematical justification, will be available in an upcoming peer-reviewed journal article.

CONSTRUCTOR

my $fastAPD_obj = Bio::fastAPD->new();

INITIALIZER

The initialization subroutine accepts a reference to an array of sequence reads from a multiple sequence alignment. It accepts alphabet designations of 'dna', 'rna' or 'protein'. In order to ignore specific positions in the sequence alignment a binary mask variable may be supplied consisting of a string of 1's (evaluate) and 0's (ignore) of length equal to the length of the alignment. By default, all positions in the alignment are evaluated.

Acceptable characters -

'dna' - ACGT, 'N' for missing base, '-', '~', or '.' for gaps (alignment padding)

'rna' - ACGU, 'N' for missing base, '-', '~', or '.' for gaps (alignment padding)

'protein' - ACDEFGHIKLMNPQRSTVWY, '*' for stop codon, 'X' for missing amino acid, '-', '~', or '.' for gaps (alignment padding)

$fastAPD_obj->initialize(-seq_ref  => \@sequences,
                         -alphabet => 'dna',
                         -mask     => $mask);

OBJECT METHODS

Additional info in the Appendix.

   fastAPD_obj->apd('gap_base')  
                                       
       # Perform rapid APD calculation.   

   fastAPD_obj->std_err('gap_base')
                                      
       # Estimate standard error of the APD result.      
                                                  
   fastAPD_obj->gap_threshold()

       # Set or get max proportion of gap symbols (-,~,.) allowable for a valid position

   fastAPD_obj->null_threshold()

       # Set or get max proportion of N symbols allowable for a valid position

   fastAPD_obj->end_threshold()

       # Set or get max proportion of ragged end (-,~,.) symbols allowable for a valid position

   fastAPD_obj->n_reads()

       # Get the number of reads

   fastAPD_obj->n_valid_positions()

       # Get the number of positions in the alignment which meet the analysis criteria
          
   fastAPD_obj->valid_positions()

       # Get the positions in the alignment which meet the analysis criteria

   fastAPD_obj->width()

       # Get the width of the alignment

   fastAPD_obj->freqs()

       # Get the frequencies of each symbol at each position

   fastAPD_obj->consensus_alignment()

       # Get the consensus sequence of the aligned input sequences

AUTHORS

Joseph D. Baugher, PhD <joebaugher@hotmail.com> Fernando J. Pineda, PhD <fernando.pineda@jhu.edu>

MAINTAINER

Joseph D. Baugher, PhD <joebaugher@hotmail.com>

LICENSE AND COPYRIGHT

Copyright (c) 2014 Joseph D. Baugher (<joebaugher@hotmail.com>). All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

APPENDIX

The following documentation describes the Bio::fastAPD module implementation.

new

Title   : new
Usage   : my $fastAPD_obj = Bio::fastAPD->new()
Function: Creates a Bio::fastAPD object
Returns : A blessed reference
Args    : No arguments

initialize

Title   : initialize
Usage   : $fastAPD_obj->initialize(seq_ref  => \@sequences,
                                   alphabet => 'dna',
                                   mask     => $mask);
Function: Initializes a Bio::fastAPD object.
             1. Initializes internal variables. 
             2. Prepares sequence reads for analysis. 
             3. Counts symbol frequencies.
             4. Performs quality checks for erroneous symbols
             5. Defines valid positions for analysis
Returns : 
Args    : -seq_ref => \@sequences
            a reference to an array of sequences from an MSA
          -alphabet => 'dna'
            Alphabet designations of 'dna', 'rna' or 'protein'.
            
            Acceptable characters - 
            'dna' - 'ACGT', 'N' for missing base, '-', '~', or '.' for gaps(padding)
            'rna' - 'ACGU', 'N' for missing base, '-', '~', or '.' for gaps(padding)
            'protein' - 'ACDEFGHIKLMNPQRSTVWY', '*' for stop codon, 'X' for missing 
                         amino acid, '-', '~', or '.' for gaps (alignment padding)
          -mask => $mask
            In order to ignore specific positions in the sequence alignment a mask 
            variable should be created consisting of a string of 1's (evaluate) and 0's 
            (ignore) of length equal to the length of the alignment. By default, all 
            positions in the alignment are evaluated.
Calls   : _standardize_the_read, _accumulate_symbol_frequencies, _qc_symbols,
          _define_valid_positions

_standardize_the_read

Title   : _standardize_the_read
Usage   : $standardized_seq_string = _standardize_the_read($seq_string);
Function: Replaces ragged end (-) padding symbols with (#) symbols to differentiate
            between internal gaps. Converts all symbols to uppercase.
Returns : A variable containing the standardized sequence
Args    : A variable containing the input sequence

_accumulate_symbol_frequencies

Title   : _accumulate_symbol_frequencies
Usage   : _accumulate_symbol_frequencies();
Function: Counts and stores frequencies of each symbol at each position.
Returns : 1
Args    : None

_qc_symbols

Title   : _qc_symbols
Usage   : _qc_symbols();
Function: Performs error checking for acceptable symbols. May carp(warn) or croak(die).
Returns : 1
Args    : None

_define_valid_positions

Title   : _define_valid_positions
Usage   : _define_valid_positions();
Function: Builds array of valid alignment positions for analysis (below the null, gap, 
          and end thresholds and are not masked).  
Returns : 1 or undef if no positions are valid
Args    : None

apd

Title   : apd
Usage   : $apd = $fastAPD_obj->apd('gap_base');
Function: Returns the APD value            
Returns : APD
Args    : Comparison algorithm
              Default = 'gap_base'.
              Defines acceptable pairwise comparisons
              'gap_base'     - pairwise deletion
                             - valid comparisons = base to base and gap to base
              'base_base'    - pairwise deletion
                             - valid comparisons = base to base only
              'complete_del' - complete deletion
                             - all positions containing any missing data ignored
Calls   :  _create_alpha_mask, _calculate_apd

_calculate_apd

Title   : _calculate_apd
Usage   : _calculate_apd();
Function: Performs the fast APD calculation - nucleotide diversity per pair of bases 
            (ratio of sums). 
            APD ($d) = Total Mismatches ($m) / Total Pairwise comparisons ($p)
              where $m = $p - Total Matches (%matches)
              where matches = binomial coefficient(frequency of a given symbol, 2) 
                              summed over appropriate symbols and valid positions
              where $p      = binomial coefficient(number of rows at a given position, 2)
                              summed over valid positions                 
Returns : The APD value ($d)
Args    : None
Calls   : _create_alpha_mask, choose_2

std_err

Title   : std_err
Usage   : $se = $fastAPD_obj->std_err('gap_base');
Function: Iterates over the number of reads ($_K) adjusting the original frequency
            counts to ignore one read at each iteration. Calls an APD subroutine at each
            iteration and stores the diversity values. Calls _jackknifeSE to calculate
            the standard error.
Returns : The standard error of the APD ($_seD) as estimated by jackknife resampling.
Args    : Comparison method
              Default = 'gap_base'.
Calls   : _calculate_apd, _jackknifeSE

_create_alpha_mask

Title   : _create_alpha_mask
Usage   : _create_alpha_mask();
Function: Creates a binary mask (1 = valid for pairwise comparisons; 0 = invalid) 
            corresponding to all symbols, dependent on the comparison_type input.
            Calls _define_valid_positions in case the valid positions should be
            refined for complete_del comparison type.                
Returns : 1
Args    : None
Calls   : _define_valid_positions

_choose_2

Title   : _choose_2
Usage   : _choose_2($number_of_pairs);
Function: A simplified binomial coefficient calculation - the number of ways of 
            choosing k outcomes from n possibilities, where k = 2.              
Returns : the value of the binomial coefficient
Args    : the number of possibilities, n

_jackknifeSE

Title   : _jackknifeSE
Usage   : _jackknifeSE(@diversity_values);
Function: Calculates the standard error based on jackknife resampling             
Returns : the standard error
Args    : an array of values obtained from the iterative jackknife leave-one-out process
Calls   : _mean

_mean

Title   : _mean
Usage   : _mean(@values);
Function: Calculates the mean of an array of values            
Returns : the mean
Args    : an array of values 

gap_threshold

Title   : gap_threshold
Usage   : $fastAPD_obj->gap_threshold(0.1) or $fastAPD_obj->gap_threshold();
Function: If an argument is provided, this function sets the gap threshold. 
           Returns the gap threshold.            
Returns : the gap threshold
Args    : optional value for setting the gap threshold 

null_threshold

Title   : null_threshold
Usage   : $fastAPD_obj->null_threshold(0.1) or $fastAPD_obj->null_threshold();
Function: If an argument is provided, this function sets the null threshold. 
           Returns the null threshold.            
Returns : the null threshold
Args    : optional value for setting the null threshold 

end_threshold

Title   : end_threshold
Usage   : $fastAPD_obj->end_threshold(0.1) or $fastAPD_obj->end_threshold();
Function: If an argument is provided, this function sets the ragged end threshold. 
           Returns the ragged end threshold.            
Returns : the ragged end threshold
Args    : optional value for setting the ragged end threshold 

n_reads

Title   : n_reads
Usage   : $fastAPD_obj->n_reads();
Function: Returns the number of reads in the input alignment            
Returns : the number of reads
Args    : none

n_valid_positions

Title   : n_valid_positions
Usage   : $fastAPD_obj->n_valid_positions();
Function: Returns the number of positions in the alignment which meet the analysis 
          criteria         
Returns : the number of valid positions
Args    : none

valid_positions

Title   : valid_positions
Usage   : $fastAPD_obj->valid_positions();
Function: Returns an array reference to the positions in the alignment which meet 
          the analysis criteria      
Returns : an array reference to the valid positions 
Args    : none

width

Title   : width
Usage   : $fastAPD_obj->width();
Function: Returns the width of the alignment   
Returns : the width of the alignment   
Args    : none

freqs

Title   : freqs
Usage   : $fastAPD_obj->freqs();
Function: Returns frequencies of each symbol at each position   
Returns : a reference to an array containing frequencies of each symbol at each position    
Args    : none

consensus_alignment

Title   : consensus_alignment
Usage   : $fastAPD_obj->consensus_alignment();
Function: Returns a consensus sequence based on the aligned reads  
Returns : a consensus sequence based on the aligned reads      
Args    : none
Calls   : _calculate_aligned_consensus

_calculate_aligned_consensus

Title   : _calculate_aligned_consensus
Usage   : _calculate_aligned_consensus();
Function: Creates a consensus sequence by calling _argmax to detect the most
            frequent symbol at each position in the alignment
Returns : a consensus sequence based on the aligned reads      
Args    : none
Calls   : _argmax

_argmax

Title   : _argmax
Usage   : _argmax($_freq[$i]);
Function: Detects the most frequently occurring symbol at a position
Returns : the most frequently occurring symbol      
Args    : a hash containing the frequencies of each symbol at a given position