Author image Joseph D Baugher
and 1 contributors

NAME

Bio::fastAPD -- rapid calculation of average pairwise difference (APD) for multiple sequence alignments

VERSION

Version 1.10.0

SYNOPSIS

    use Bio::fastAPD;
        
    my $file_name = "example_data.fasta";
    open(my $input_fh, '<', $file_name) or croak ("Could not open $file_name\n");
    chomp(my @fasta_lines = <$input_fh>);
    close $input_fh;

    # Create an array of aligned sequences
    my @sequences;
    my $curr_seq;
    foreach my $line (@fasta_lines) {
        if (substr($line, 0, 1) eq ">") { 
            if ($curr_seq) { push(@sequences, $curr_seq) }
            $curr_seq = ();
        }
        else { $curr_seq .= $line }
    }
    if ($curr_seq) { push(@sequences, $curr_seq) }       
    
    my $fastAPD_obj = Bio::fastAPD->new();
    $fastAPD_obj->initialize(seq_array_ref => \@sequences,
                             alphabet      => 'dna');

    my $apd = $fastAPD_obj->apd('gap_base');
                                            
    my $std_err = $fastAPD_obj->std_err('gap_base');

    my $num_reads     = $fastAPD_obj->n_reads;                                      
    my $num_positions = $fastAPD_obj->width;        
   
    print join("\t", qw(File APD StdErr Positions Reads)), "\n";      
    print join("\t", $file_name, $apd, $std_err, $num_positions, $num_reads), "\n";  

    # OR

    use Bio::fastAPD;
    use Bio::AlignIO;

    my $file_name = 'example_data.fasta';
    my $alignio_obj = Bio::AlignIO->new( -file     => $file_name,
                                         -format   => 'fasta',
                                         -alphabet => 'dna');
    my $aln_obj = $alignio_obj->next_aln();
    
    # Create an array of aligned sequences
    my @sequences;  
    foreach my $seq_obj ($aln_obj->each_seq) { push(@sequences, $seq_obj->seq()) }
        
    my $fastAPD_obj = Bio::fastAPD->new();
    $fastAPD_obj->initialize(seq_array_ref => \@sequences,
                             alphabet      => 'dna');
        
    my $apd = $fastAPD_obj->apd('gap_base');
                                                
    my $std_err = $fastAPD_obj->std_err('gap_base');
   
    my $num_reads     = $fastAPD_obj->n_reads;                                      
    my $num_positions = $fastAPD_obj->width;        
          
    print join("\t", qw(File APD StdErr Positions Reads)), "\n";      
    print join("\t", $file_name, $apd, $std_err, $num_positions, $num_reads), "\n";  
        

DESCRIPTION

The Bio::fastAPD module provides a computationally efficient method for the calculation of average pairwise difference (APD), a measure of nucleotide diversity, from multiple sequence alignment (MSA) data. This module also provides rapid standard error estimation of the APD using an efficient jackknife resampling method. Further description of the methods implemented in this module, including mathematical justification, will be available in an upcoming peer-reviewed journal article.

CONSTRUCTOR

    my $fastAPD_obj = Bio::fastAPD->new();

INITIALIZER

The initialization subroutine accepts a reference to an array of sequence reads from a multiple sequence alignment. It accepts alphabet designations of 'dna', 'rna' or 'protein'. In order to ignore specific positions in the sequence alignment a binary mask variable may be supplied consisting of a string of 1's (evaluate) and 0's (ignore) of length equal to the length of the alignment. By default, all positions in the alignment are evaluated.

Acceptable characters -

'dna' - ACGT, 'N' for missing base, '-', '~', or '.' for gaps (alignment padding)

'rna' - ACGU, 'N' for missing base, '-', '~', or '.' for gaps (alignment padding)

'protein' - ACDEFGHIKLMNPQRSTVWY, '*' for stop codon, 'X' for missing amino acid, '-', '~', or '.' for gaps (alignment padding)

    $fastAPD_obj->initialize(-seq_ref  => \@sequences,
                             -alphabet => 'dna',
                             -mask     => $mask);

OBJECT METHODS

Additional info in the Appendix.

    fastAPD_obj->apd('gap_base')  
                                        
        # Perform rapid APD calculation.   
 
    fastAPD_obj->std_err('gap_base')
                                       
        # Estimate standard error of the APD result.      
                                                   
    fastAPD_obj->gap_threshold()

        # Set or get max proportion of gap symbols (-,~,.) allowable for a valid position

    fastAPD_obj->null_threshold()

        # Set or get max proportion of N symbols allowable for a valid position

    fastAPD_obj->end_threshold()

        # Set or get max proportion of ragged end (-,~,.) symbols allowable for a valid position

    fastAPD_obj->n_reads()

        # Get the number of reads

    fastAPD_obj->n_valid_positions()

        # Get the number of positions in the alignment which meet the analysis criteria
           
    fastAPD_obj->valid_positions()

        # Get the positions in the alignment which meet the analysis criteria

    fastAPD_obj->width()

        # Get the width of the alignment

    fastAPD_obj->freqs()

        # Get the frequencies of each symbol at each position

    fastAPD_obj->consensus_alignment()

        # Get the consensus sequence of the aligned input sequences

AUTHORS

Joseph D. Baugher, PhD <joebaugher@hotmail.com> Fernando J. Pineda, PhD <fernando.pineda@jhu.edu>

MAINTAINER

Joseph D. Baugher, PhD <joebaugher@hotmail.com>

LICENSE AND COPYRIGHT

Copyright (c) 2014 Joseph D. Baugher (<joebaugher@hotmail.com>). All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

APPENDIX

The following documentation describes the Bio::fastAPD module implementation.

new

 Title   : new
 Usage   : my $fastAPD_obj = Bio::fastAPD->new()
 Function: Creates a Bio::fastAPD object
 Returns : A blessed reference
 Args    : No arguments

initialize

 Title   : initialize
 Usage   : $fastAPD_obj->initialize(seq_ref  => \@sequences,
                                    alphabet => 'dna',
                                    mask     => $mask);
 Function: Initializes a Bio::fastAPD object.
              1. Initializes internal variables. 
              2. Prepares sequence reads for analysis. 
              3. Counts symbol frequencies.
              4. Performs quality checks for erroneous symbols
              5. Defines valid positions for analysis
 Returns : 
 Args    : -seq_ref => \@sequences
             a reference to an array of sequences from an MSA
           -alphabet => 'dna'
             Alphabet designations of 'dna', 'rna' or 'protein'.
             
             Acceptable characters - 
             'dna' - 'ACGT', 'N' for missing base, '-', '~', or '.' for gaps(padding)
             'rna' - 'ACGU', 'N' for missing base, '-', '~', or '.' for gaps(padding)
             'protein' - 'ACDEFGHIKLMNPQRSTVWY', '*' for stop codon, 'X' for missing 
                          amino acid, '-', '~', or '.' for gaps (alignment padding)
           -mask => $mask
             In order to ignore specific positions in the sequence alignment a mask 
             variable should be created consisting of a string of 1's (evaluate) and 0's 
             (ignore) of length equal to the length of the alignment. By default, all 
             positions in the alignment are evaluated.
 Calls   : _standardize_the_read, _accumulate_symbol_frequencies, _qc_symbols,
           _define_valid_positions

_standardize_the_read

 Title   : _standardize_the_read
 Usage   : $standardized_seq_string = _standardize_the_read($seq_string);
 Function: Replaces ragged end (-) padding symbols with (#) symbols to differentiate
             between internal gaps. Converts all symbols to uppercase.
 Returns : A variable containing the standardized sequence
 Args    : A variable containing the input sequence

_accumulate_symbol_frequencies

 Title   : _accumulate_symbol_frequencies
 Usage   : _accumulate_symbol_frequencies();
 Function: Counts and stores frequencies of each symbol at each position.
 Returns : 1
 Args    : None

_qc_symbols

 Title   : _qc_symbols
 Usage   : _qc_symbols();
 Function: Performs error checking for acceptable symbols. May carp(warn) or croak(die).
 Returns : 1
 Args    : None

_define_valid_positions

 Title   : _define_valid_positions
 Usage   : _define_valid_positions();
 Function: Builds array of valid alignment positions for analysis (below the null, gap, 
           and end thresholds and are not masked).  
 Returns : 1 or undef if no positions are valid
 Args    : None

apd

 Title   : apd
 Usage   : $apd = $fastAPD_obj->apd('gap_base');
 Function: Returns the APD value            
 Returns : APD
 Args    : Comparison algorithm
               Default = 'gap_base'.
               Defines acceptable pairwise comparisons
               'gap_base'     - pairwise deletion
                              - valid comparisons = base to base and gap to base
               'base_base'    - pairwise deletion
                              - valid comparisons = base to base only
               'complete_del' - complete deletion
                              - all positions containing any missing data ignored
 Calls   :  _create_alpha_mask, _calculate_apd
 

_calculate_apd

 Title   : _calculate_apd
 Usage   : _calculate_apd();
 Function: Performs the fast APD calculation - nucleotide diversity per pair of bases 
             (ratio of sums). 
             APD ($d) = Total Mismatches ($m) / Total Pairwise comparisons ($p)
               where $m = $p - Total Matches (%matches)
               where matches = binomial coefficient(frequency of a given symbol, 2) 
                               summed over appropriate symbols and valid positions
               where $p      = binomial coefficient(number of rows at a given position, 2)
                               summed over valid positions                 
 Returns : The APD value ($d)
 Args    : None
 Calls   : _create_alpha_mask, choose_2
 

std_err

 Title   : std_err
 Usage   : $se = $fastAPD_obj->std_err('gap_base');
 Function: Iterates over the number of reads ($_K) adjusting the original frequency
             counts to ignore one read at each iteration. Calls an APD subroutine at each
             iteration and stores the diversity values. Calls _jackknifeSE to calculate
             the standard error.
 Returns : The standard error of the APD ($_seD) as estimated by jackknife resampling.
 Args    : Comparison method
               Default = 'gap_base'.
 Calls   : _calculate_apd, _jackknifeSE

_create_alpha_mask

 Title   : _create_alpha_mask
 Usage   : _create_alpha_mask();
 Function: Creates a binary mask (1 = valid for pairwise comparisons; 0 = invalid) 
             corresponding to all symbols, dependent on the comparison_type input.
             Calls _define_valid_positions in case the valid positions should be
             refined for complete_del comparison type.                
 Returns : 1
 Args    : None
 Calls   : _define_valid_positions
 

_choose_2

 Title   : _choose_2
 Usage   : _choose_2($number_of_pairs);
 Function: A simplified binomial coefficient calculation - the number of ways of 
             choosing k outcomes from n possibilities, where k = 2.              
 Returns : the value of the binomial coefficient
 Args    : the number of possibilities, n
 

_jackknifeSE

 Title   : _jackknifeSE
 Usage   : _jackknifeSE(@diversity_values);
 Function: Calculates the standard error based on jackknife resampling             
 Returns : the standard error
 Args    : an array of values obtained from the iterative jackknife leave-one-out process
 Calls   : _mean
 

_mean

 Title   : _mean
 Usage   : _mean(@values);
 Function: Calculates the mean of an array of values            
 Returns : the mean
 Args    : an array of values 
 

gap_threshold

 Title   : gap_threshold
 Usage   : $fastAPD_obj->gap_threshold(0.1) or $fastAPD_obj->gap_threshold();
 Function: If an argument is provided, this function sets the gap threshold. 
            Returns the gap threshold.            
 Returns : the gap threshold
 Args    : optional value for setting the gap threshold 
 

null_threshold

 Title   : null_threshold
 Usage   : $fastAPD_obj->null_threshold(0.1) or $fastAPD_obj->null_threshold();
 Function: If an argument is provided, this function sets the null threshold. 
            Returns the null threshold.            
 Returns : the null threshold
 Args    : optional value for setting the null threshold 
 

end_threshold

 Title   : end_threshold
 Usage   : $fastAPD_obj->end_threshold(0.1) or $fastAPD_obj->end_threshold();
 Function: If an argument is provided, this function sets the ragged end threshold. 
            Returns the ragged end threshold.            
 Returns : the ragged end threshold
 Args    : optional value for setting the ragged end threshold 
 

n_reads

 Title   : n_reads
 Usage   : $fastAPD_obj->n_reads();
 Function: Returns the number of reads in the input alignment            
 Returns : the number of reads
 Args    : none
 

n_valid_positions

 Title   : n_valid_positions
 Usage   : $fastAPD_obj->n_valid_positions();
 Function: Returns the number of positions in the alignment which meet the analysis 
           criteria         
 Returns : the number of valid positions
 Args    : none
 

valid_positions

 Title   : valid_positions
 Usage   : $fastAPD_obj->valid_positions();
 Function: Returns an array reference to the positions in the alignment which meet 
           the analysis criteria      
 Returns : an array reference to the valid positions 
 Args    : none
 

width

 Title   : width
 Usage   : $fastAPD_obj->width();
 Function: Returns the width of the alignment   
 Returns : the width of the alignment   
 Args    : none

freqs

 Title   : freqs
 Usage   : $fastAPD_obj->freqs();
 Function: Returns frequencies of each symbol at each position   
 Returns : a reference to an array containing frequencies of each symbol at each position    
 Args    : none

consensus_alignment

 Title   : consensus_alignment
 Usage   : $fastAPD_obj->consensus_alignment();
 Function: Returns a consensus sequence based on the aligned reads  
 Returns : a consensus sequence based on the aligned reads      
 Args    : none
 Calls   : _calculate_aligned_consensus

_calculate_aligned_consensus

 Title   : _calculate_aligned_consensus
 Usage   : _calculate_aligned_consensus();
 Function: Creates a consensus sequence by calling _argmax to detect the most
             frequent symbol at each position in the alignment
 Returns : a consensus sequence based on the aligned reads      
 Args    : none
 Calls   : _argmax

_argmax

 Title   : _argmax
 Usage   : _argmax($_freq[$i]);
 Function: Detects the most frequently occurring symbol at a position
 Returns : the most frequently occurring symbol      
 Args    : a hash containing the frequencies of each symbol at a given position