The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::SAGE::DataProcessing - Processes raw serial analysis of gene expression (SAGE) data.

SYNOPSIS

  use Bio::SAGE::DataProcessing;
  $sage = Bio::SAGE::DataProcessing->new();

  # open sequence and quality files
  open( READS, "library.fasta" );
  open( QUAL, "library.qual.fasta" );

  # collect ditags and statistics from reads
  $sage->process_library( *READS, *QUAL );

  # close files
  close( READS );
  close( QUAL );

  # output tags in descending order of expression
  my %tags = %{$sage->get_tagcounts()};
  open( TAGS, ">library.tags" );
  map { print TAGS join( "\t", $_, $tags{$_} ) . "\n" } sort { $tags{$b} <=> $tags{$a} } keys %tags;
  close( TAGS );

  # tag AAACCGGGTT matches two different genes
  # so 15th base pair may help resolve this
  $sage->print_extra_base_calculation( $sage->get_extract_base_calculation( "AAACCGGGTT" ) );

DESCRIPTION

This module provides several tools for processing and analyzing serial analysis of gene expression (SAGE) libraries.

README

BACKGROUND

Serial analysis of gene expression (SAGE) is a molecular technique for generating a near-global snapshot of a cell population’s transcriptome. Briefly, the technique extracts short sequences at defined positions of transcribed mRNA. These short sequences are then paired to form ditags. The ditags are concatamerized to form long sequences that are then cloned. The cloned DNA is then sequenced. Bioinformatic techniques are then employed to determine the original short tag sequences, and to derive their progenitor mRNA. The number of times a particular tag is observed can be used to quantitate the amount of a particular transcript. The original technique was described by Velculescu et al. (1995) and utilized an ~14bp sequence tag. A modified protocol was introduced by Saha et al. (2002) that produced ~21bp tags.

PURPOSE

This module facilitates the processing of SAGE data. Specifically:

  1. extracting ditags from raw sequence reads.
  2. extracting tags from ditags, with the option to
     exclude tags if the Phred scores (described by
     Ewing and Green, 1998a and Ewing et al., 1998b)
     do not meet a minimum cutoff value.
  3. calculating descriptive values
  4. statistical analysis to determine, where possible,
     additional nucleotides to extend the length of the
     SAGE tag (thus facilitating more accurate tag to
     gene mapping).

Both regular SAGE (14mer tag) and LongSAGE (21mer tag) are supported by this module. Future protocols should be configurable with this module.

REFERENCES

  Velculescu V, Zhang L, Vogelstein B, Kinzler KW. (1995)
  Serial analysis of gene expression. Science. 270:484-487.

  Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B,
  Kinzler KW, Velculescu V. (2002) Using the transcriptome
  to annotate the genome. Nat. Biotechnol. 20:508-512.

  Ewing B, Hillier L, Wendl MC, Green P. (1998a) Base-calling
  of automated sequencer traces using phred. I. Accuracy
  assessment. Genome Res. 8:175-185.

  Ewing B, Green P. (1998b) Base-calling of automated
  sequencer traces using phred. II. Error probabilities.
  Genome Res. 8:186-194.

INSTALLATION

Follow the usual steps for installing any Perl module:

  perl Makefile.PL
  make test
  make install

PREREQUISITES

None (this module used to require the Statistics::Distributions package).

CHANGES

  1.20 2004.10.15 - Minor spelling errors and misuse of terminology fixed in docs.
                  - Module now allows FASTA files with a blank line folling the
                    header (denoting an attempted read with no sequence), but prints
                    a warning to STDERR that this has happened. Module died previously.
  1.11 2004.06.20 - Added flag in constructor to keep duplicate ditags.
  1.10 2004.06.02 - Wrote new documentation and modified several methods to use the read-by-read
                    processing approach (see line below).
                  - Revamped the module to conserve memory. Reads are now processed one at a time
                    and then discarded. The memory requirements in the previous versions were
                    prohibitive to those with regular desktop machines.
                  - The Bio::SAGE::DataProcessing::Filter package can be subclassed to create
                    custom filters at the ditag and tag processing steps (previous versions only
                    allowed one approach to ditag/tag filtering).
  1.01 2004.05.27 - Fixed bug where extract_tag_counts didn't work with quality cutoff defined.
                  - extract_tags was not applying the get_quality_cutoff value (was returning all data)
                  - Duplicate ditags are now removed by default.
  1.00 2004.05.23 - Initial release.

VARIABLES

Globals

    $PROTOCOL_SAGE

      Hashref containing default protocol parameters for the
      regular/original SAGE protocol (see set_protocol
      documentation for more information).

    $PROTOCOL_LONGSAGE

      Hashref containing default protocol parameters for the
      LongSAGE protocol (see set_protocol documentation
      for more information).

    $ENZYME_NLAIII = 'CATG'

      Constant denoting the recognition sequence for NlaIII.

    $ENZYME_SAU3A = 'GATC'

      Constant denoting the recognition sequence for Sau3a.

    $DEFAULT_TAG_FILTER

      A default tag filter used when none is specified.
      This filter rejects tags that contain any base pair
      with a Phred quality score < 15 or an average
      Phred quality score over all bases < 30.

    $DEFAULT_DITAG_FILTER

      A default ditag filter used when none is specified.
      This filter rejects ditags that contain any base pair
      with a Phred quality score < 20.

Settings

    $DEBUG = 0

      Prints debugging output if value if >= 1.

CLASS METHODS

new <$enzyme>, <\%protocol>, <$bKeepDuplicates>

Constructor for a new Bio::SAGE::DataProcessing object.

Arguments

$enzyme (optional)

  The anchoring enzyme recognition site. This is
  typically NlaIII (CATG) or Sau3a (GATC). The
  default is the recognition sequence "CATG" (NlaIII).

\%protocol (optional)

  A hashref containing specifics of the protocol. Two
  pre-made parameter sets are available: $PROTOCOL_SAGE
  (regular SAGE) and $PROTOCOL_LONGSAGE (LongSAGE).

  The required hash keys:

    MINIMUM_DITAG_LENGTH | The shortest length a valid
                           ditag can be (the length
                           should include the anchoring
                           enzyme site sequences).
    MAXIMUM_DITAG_LENGTH | The longest length a valid
                           ditag can be (the length
                           should include the anchoring
                           enzyme site sequences).
    TAG_LENGTH           | The expected tag length (the
                           length should NOT include the
                           anchoring enzyme site sequence).
    IGNORE_TAGS          | An arrayref listing tag
                           sequences that should be
                           ignored during tag extraction
                           (i.e. linker-derived sequences).

  The parameters for the default configurations are:

                           +----------------+--------------------+
                           | $PROTOCOL_SAGE | $PROTOCOL_LONGSAGE |
    +----------------------+----------------+--------------------+
    | MINIMUM_DITAG_LENGTH |       29       |         40         |
    +----------------------+----------------+--------------------+
    | MAXIMUM_DITAG_LENGTH |       32       |         46         |
    +----------------------+----------------+--------------------+
    | TAG_LENGTH           |       10       |         17         |
    +----------------------+----------------+--------------------+
    | IGNORE_TAGS          |   TCCCTATTAA   | TCGGACGTACATCGTTA  |
    |                      |   TCCCCGTACA   | TCGGATATTAAGCCTAG  |
    +----------------------+----------------+--------------------+

  Once the Bio::SAGE::DataProcessing object has been instantiated,
  the enzyme and protocol CANNOT be altered.

$bKeepDuplicates (optional)

  If this argument is TRUE (non-zero) then ditags with
  identical sequence will be kept.  The default behavior is
  to discard these ditags as they likely represent
  preferential PCR amplification.

Usage

  my $sage = Bio::SAGE::DataProcessing->new( $Bio::SAGE::DataProcessing::ENZYME_NLAIII,
                                             $Bio::SAGE::DataProcessing::PROTOCOL_LONGSAGE );

  # alternatively:
  my $sage = Bio::SAGE::DataProcessing->new( "CATG", { 'MINIMUM_DITAG_LENGTH' => 31,
                                                       'MAXIMUM_DITAG_LENGTH' => 34,
                                                       'TAG_LENGTH'           => 11,
                                                       'IGNORE_TAGS'          => \( 'TCCCTATTAA', 'TCCCCGTACA' ) } );

INSTANCE METHODS

set_ditag_filter $filterObject

Sets a new filter object (a concrete subclass of Bio::SAGE::DataProcessing::Filter) that is applied to ditags as they're extracted from sequence reads.

The default filter uses a Bio::SAGE::DataProcessing::MinimumPhredFilter instance that is set to reject any ditags that do not have at least Phred 20 (p<=0.01) for all nucleotides.

Arguments

$filterObject

  An object ref to a concrete subclass of
  Bio::SAGE::DataProcessing::Filter.

set_tag_filter $filterObject

Sets a new filter object (a concrete subclass of Bio::SAGE::DataProcessing::Filter) that is applied to tags as they're extracted from ditags.

The default filter uses a Bio::SAGE::DataProcessing::AveragePhredFilter instance that is set to reject any tags that do not have an average Phred 30 score (p<=0.001) over all nucleotides and at least Phred 15 (p<=0.0316) at each nucleotide.

Arguments

$filterObject

  An object ref to a concrete subclass of
  Bio::SAGE::DataProcessing::Filter.

get_enzyme

Gets the current anchoring enzyme recognition site.

Arguments

  None.

Returns

  The current anchoring enzyme recognition site. By
  default, this will be 'CATG', the NlaIII recognition
  site.

Usage

  my $sage = Bio::SAGE::DataProcessing->new();
  print "ENZYME_SITE: " . $sage->get_enzyme() . "\n";

get_protocol

Gets a copy (in other words, modifying the returned hash does not affect the object's settings) of the hash describing the protocol for this instance.

Arguments

  None.

Returns

  A hashref of the current protocol settings.  See the
  documentation for new (constructor) for more details
  on the contents of this hash.

Usage

  my $sage = Bio::SAGE::DataProcessing->new();
  print "Default protocol:\n";
  my %protocol = %{$sage->get_protocol()};
  map { print $_ . "=" . $protocol{$_} . "\n" } keys %protocol;

process_library $sequence_handle, <$scores_handle>

Processes reads from a Perl handle to FASTA format sequence data. Phred scores in matching FASTA format can be read concurrently.

An example of FASTA format sequence:

  >SEQUENCE0001
  ACAGATAGACAGAGATATAGAGACATATTTAGAGACAAATCGCGCAGGCGCGCGACATA
  TGACTAGTTTATATCATCAGTATTAGCGATTATGACTATTATATATTACTGATTGATCT
  ATAGCGCGATTATATCTATCTATGCATTCGATCATGCTATTATCGTATACTACTGCTAG
  AGAGGACGACGCAGGCAGCGATTATATCTATTTATA
  >SEQUENCE0002
  CGCGACGCATGTCAGTAGCTAGCTGCGCCCGAATATATATCGTCATACGGATTCGTAGC
  CCCCCGCGGAGTCTGATTATATCTGATT

An example of FASTA format quality data:

  >SEQUENCE0001
  11 17 18 16 19 17 19 19 16 19 19 16 11 10 9 15 10 12 24 24 35 29 29 39 40 40 40 40 37 37 46 46 40 40 40 40 56 56 56 56 35 35 35 35 35 35 56 40 40 46 40 40 39 39 35 39 56 56 51 51
  51 51 51 56 35 35 35 35 35 35 40 40 51 45 51 51 39 39 39 39 39 39 40 40 40 40 40 40 56 56 56 56 56 46 46 40 39 39 39 45 45 45 56 56 56 56 56 56 56 56 40 39 39 39 39 35 39 39 39 39
  45 56 45 45 45 45 51 35 39 39 39 39 39 40 40 40 40 40 51 56 56 40 40 40 40 40 43 56 56 56 43 35 35 35 35 35 43 45 45 45 45 45 45 51 51 51 51 51 51 56 56 56 56 56 56 51 51 51 56 56
  7 7 9 9 11 10 13 11 10 8 10 10 8 8 8 10 10
  >SEQUENCE0002
  12 15 17 17 19 15 15 15 13 19 17 17 12 16 11 19 13 24 24 35 35 35 37 37 39 56 56 56 56 56 51 39 32 29 29 29 29 32 56 56 56 35 35 35 35 35 35 56 56 56 56 56 56 56 56 56 56 56 56 40
  40 40 46 46 40 51 40 40 40 40 40 40 51 39 39 35 35 35 35 40 40 51 45 45 45 45 51 51 56 56 56 56 56 45 45 45 45 51 51 45 45 45 40 40 40 40 40 40 40 40 40 40 56 56 56 56 56 56 51 51
  15 13 19 17 17 12 16 11 19 13

Arguments

$sequence_handle

  A Perl handle to sequence data in FASTA format.

$scores_handle (optional)

  A Perl handle to Phred scores in FASTA format. The order
  of entries must match the $sequence_handle data.  NOTE:
  many implementations of Bio::SAGE::DataProcessing::Filter
  require the scores to carry out their function.  In the
  default implementations, no filtering is done when scores
  are not defined.

Returns

  The number of sequences read.

Usage

  my $sage = Bio::SAGE::DataProcessing->new();
  open( SEQDATA, "sequences.fasta" );
  my $nReads = $sage->process_library( *SEQDATA );
  print "NUMBER_READS: $nReads\n";

process_read $sequence, <$scores>

Extracts and filters ditags from the given sequence read (and optionally supplied Phred scores). The default ditag filter (or a filter supplied to set_ditag_filter) is applied during this procedure. The resulting ditags are added to the list of currently processed ditags collected from previous calls to this method.

Ditags with identical sequence (duplicate ditags) are considered the result of PCR artifacts and only the ditag with the highest "score" (as defined by the current ditag Bio::SAGE::DataProcessing::Filter) is retained.

Arguments

$sequence

  The nucleotide sequence of the sequence read to process.

$scores (optional)

  The Phred scores corresponding to the sequence read
  supplied to the method.  The method expects space-separated
  Phred scores (ie. "20 24 54 32" etc.).

Returns

  TRUE (1) if the method was successful, FALSE (0) otherwise.

Usage

  my $sage = Bio::SAGE::DataProcessing->new();
  my $sequence = "ACGTACGT";
  my $scores = "20 25 34 12 23 45 51 23";
  if( $sage->process_read( $sequence, $scores ) ) {
    print "Successful!\n";
  }

get_ditags

Gets the list of currently extracted and valid ditags stored in this object through calls to process_read.

Arguments

  None.

Returns

  An array of ditag sequences.

Usage

  my $sage = Bio::SAGE::DataProcessing->new();
  my $sequence = "ACGTACGT";
  my $scores = "20 25 34 12 23 45 51 23";
  if( $sage->process_read( $sequence, $scores ) ) {
    my @ditags = $sage->get_ditags();
    print "Current ditags:\n";
    map { print "  ".$_."\n" } @ditags;
  }

get_tags

Uses the list of currently extracted and valid ditags to generate a list of valid SAGE tags. The default tag filter (or a filter supplied to set_tag_filter) is applied during this procedure.

Arguments

  None.

Returns

  An array of tag sequences.

Usage

  my $sage = Bio::SAGE::DataProcessing->new();
  my $sequence = "ACGTACGT";
  my $scores = "20 25 34 12 23 45 51 23";
  if( $sage->process_read( $sequence, $scores ) ) {
    my @tags = $sage->get_tags();
    print "Current tags:\n";
    map { print "  ".$_."\n" } @tags;
  }

get_tagcounts

Extracts valid tags from ditags and returns a hashref containing tag sequences paired with their respective counts.

Arguments

None.

Returns

  A hashref where the tag sequence is paired
  with its observed number.

Usage

  my $sage = Bio::SAGE::DataProcessing->new();
  my @reads = ( 'ACGTAGACATAGACAAGAGATATAGA',
                'GATAGACAAAGGAAGATTACAAGATTAT' );

  foreach my $read ( @reads ) {
    $sage->process_read( $read );
  }

  # get tag counts
  my %counts = %{$sage->get_tagcounts()};

  # print tag counts
  map { print join( "\t", $_, $counts{$_} ) . "\n" } keys %counts;

get_ditag_base_distribution [$minLength], [$maxLength]

Calculates the distribution of bases at each position and both orientations of a set of ditags. This distribution is used for calculating the 'expected' nucleotide count when determining extra bases using get_extra_base_calculation.

For example:

  CATGAAACCGTATGTAGAGAGGGACACATG
  CATGTAGACAGATAGACACAGATACCATG

has a distribution of:

          +---------------+---------------+
          |    forward    |    reverse    |
    +-----+---+---+---+---+---+---+---+---+
    | pos | A | C | G | T | A | C | G | T |
    +-----+---+---+---+---+---+---+---+---+
    |  0  | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 |
    |  1  | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
    |  2  | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 |
    |  3  | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 |
    |  4  | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
    |  5  | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
    |  6  | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
    |                    ...              |
    | 28  | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
    | 29  | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
    +-----+---+---+---+---+---+---+---+---+

Arguments

$minLength (optional)

  Ignore ditags that are smaller than this minimum length.
  If the argument is not supplied, then the minimum
  ditag length for the currently selected protocol
  is used.

$maxLength (optional)

  Ignore ditags that are larger than this maximum length.
  If the argument is not supplied, then the maximum
  ditag length for the currently selected protocol
  is used.

Returns

  A hashref where the key is the zero-based base position
  index, and the value is a hashref where the key is the
  nucleotide and the value is a hashref where the key is
  either 'fwd' or 'rev' and the value is the count of that
  nucleotide (whew!).

  i.e. $HASH{$idx}->{'A'}->{'fwd'} = 23;

Usage

  my $sage = Bio::SAGE::DataProcessing->new();
  my @ditags = ( 'CATGAAACCGTATGTAGAGAGGGACACATG',
                 'CATGTAGACAGATAGACACAGATACCATG' );

  my %DIST = %{$sage->get_ditag_base_distribution()};

  # print distribution table
  foreach my $idx ( sort { $a <=> $b } keys %DIST ) {
      print $idx . " ";
      print join( " ", map { defined( $DIST{$idx}->{$_} ) ? $DIST{$idx}->{$_} : 0 } ( 'A','C','G','T' ) );
      print "\n";
  }

get_ditag_length_distribution

Calculates the distribution of ditag lengths for a set of ditags.

For example:

  CATGAAACCGTATGTAGAGAGGGACACATG
  CATGTAGACAGATAGACACAGATACCATG
  CATGTATCGCGGCATTACGATCTAGAACATG
  CATGACGACTATATCGATAGCTAACCATG

has a distribution of:

    +-----+---+
    | len | N |
    +-----+---+
    |  29 | 2 |
    |  30 | 1 |
    |  31 | 1 |
    +-----+---+

Arguments

None.

Returns

  A hashref where the key is the ditag length, and
  the value is the number of ditags that have this
  length.

  i.e. $HASH{$len} = 1024;

Usage

  my %DIST = %{$sage->get_ditag_length_distribution()};

  # print distribution table
  foreach my $idx ( sort { $a <=> $b } keys %DIST ) {
      print join( "\t", $idx, $DIST{$idx} ) . "\n";
  }

get_extra_base_calculation $tag

Calculates the number of times a given nucleotide is seen at additional positions of a SAGE tag. This method is only supported in methods where there the range of ditag size allows for extra bases to be included in some (or all) population of ditags.

For example, the ditag sequence:

  CATGTATCGCGGCATTACGATCTAGAACATG

becomes:

  CATGTATCGCGGCA and CATGTTCTAGATCG

but an additonal TTA sequence is stored in the middle. Some or all of these extra bases may arise from one or both of the tags.

Arguments

$tag

  The tag sequence that is the focus of the extra base
  calculation.  Only ditags that have this tag sequence
  are considered in the calculation.

  The method checks the length of specified tag and
  checks whether it begins with the expected anchoring
  enzyme site.  If the tag appears to be missing just
  the anchoring enzyme site, it will append this
  automatically.  Otherwise, the method will die.

Returns

  A hashref that is several keys deep in the order:
  extra base position, ditag length, nucleotide.  The
  key is the number of times the keyed event occured.

  In other words:
    $hash->{$position}->{$ditag_length}->{$nucleotide} = $obs;

Usage

  my $dist = $sage->get_extra_base_calculation( "AAACGAATTA" );

  # dump results
  foreach my $ditag_length ( keys %$dist ) {
    foreach my $position ( keys %{$dist->{$ditag_length}} ) {
      foreach my $nucleotide ( keys %{$dist->{$ditag_length}->{$position}} ) {
        print join( ",", $ditag_length, $position, $nucleotide ) .
              "\t" .
              $dist->{$ditag_length}->{$position}->{$nucleotide} .
              "\n";
      }
    }
  }

Prints a formatted report to the specified handle.

An example output looks like:

           +------+------+------+------+
           |   A  |   C  |   G  |   T  |
  +----+---+------+------+------+------+
  | 29 | 0 |  183 |   43 |   31 |   68 |
  | 30 | 0 | 2637 |   23 |   23 |   37 |
  | 31 | 0 | 2624 |    0 |   14 |    0 |
  | 32 | 0 |  665 |    0 |    1 |    0 |
  +----+---+------+------+------+------+
  | 30 | 1 |  639 |  784 |  435 |  862 |
  | 31 | 1 |  188 | 1875 |  198 |  377 |
  | 32 | 1 |    4 |  658 |    0 |    4 |
  +----+---+------+------+------+------+
  | 31 | 2 |  609 |  588 |  355 | 1086 |
  | 32 | 2 |  100 |  204 |  106 |  256 |
  +----+---+------+------+------+------+
  | 32 | 3 |  199 |   95 |   88 |  284 |
  +----+---+------+------+------+------+

The first two columns are the ditag size and the extra base's relative 0-indexed position, respectively. The remaining columns are the four nucleotides and the number of ditags that met the condition described in the first two columns.

In this example, the investigator could strongly reason that the extra nucleotides are AC.

Arguments

$resultRef

  A properly formed hashref containing the results of
  an extra base calculation. This data structure can
  be obtained with the get_extra_base_calculation method
  (see the documentation for that method for more details).

$handle (optional)

  A Perl handle to output the results to.  If this
  argument is not specified, STDOUT is used by default.

Usage

  my $dist = $sage->get_extra_base_calculation( "AAACGAATTA" );

  $sage->print_extra_base_calculation( $dist, *STDERR );

COPYRIGHT

Copyright(c)2004 Scott Zuyderduyn <scottz@bccrc.ca>. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

Scott Zuyderduyn <scottz@bccrc.ca> BC Cancer Research Centre

ACKNOWLEDGEMENTS

  Greg Vatcher <gvatcher@bccrc.ca>
  Canada's Michael Smith Genome Sciences Centre <http://www.bcgsc.ca>

VERSION

  1.20

SEE ALSO

  Perl(1).

TODO

  - Add more debugging messages.
  - Method to dump/access statistics collected during processing.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 54:

Non-ASCII character seen before =encoding in 'population’s'. Assuming CP1252