Bio::SAGE::DataProcessing - Processes raw serial analysis of gene expression (SAGE) data.
use Bio::SAGE::DataProcessing; $sage = Bio::SAGE::DataProcessing->new(); # open sequence and quality files open( READS, "library.fasta" ); open( QUAL, "library.qual.fasta" ); # collect ditags and statistics from reads $sage->process_library( *READS, *QUAL ); # close files close( READS ); close( QUAL ); # output tags in descending order of expression my %tags = %{$sage->get_tagcounts()}; open( TAGS, ">library.tags" ); map { print TAGS join( "\t", $_, $tags{$_} ) . "\n" } sort { $tags{$b} <=> $tags{$a} } keys %tags; close( TAGS ); # tag AAACCGGGTT matches two different genes # so 15th base pair may help resolve this $sage->print_extra_base_calculation( $sage->get_extract_base_calculation( "AAACCGGGTT" ) );
This module provides several tools for processing and analyzing serial analysis of gene expression (SAGE) libraries.
BACKGROUND
Serial analysis of gene expression (SAGE) is a molecular technique for generating a near-global snapshot of a cell population’s transcriptome. Briefly, the technique extracts short sequences at defined positions of transcribed mRNA. These short sequences are then paired to form ditags. The ditags are concatamerized to form long sequences that are then cloned. The cloned DNA is then sequenced. Bioinformatic techniques are then employed to determine the original short tag sequences, and to derive their progenitor mRNA. The number of times a particular tag is observed can be used to quantitate the amount of a particular transcript. The original technique was described by Velculescu et al. (1995) and utilized an ~14bp sequence tag. A modified protocol was introduced by Saha et al. (2002) that produced ~21bp tags.
PURPOSE
This module facilitates the processing of SAGE data. Specifically:
1. extracting ditags from raw sequence reads. 2. extracting tags from ditags, with the option to exclude tags if the Phred scores (described by Ewing and Green, 1998a and Ewing et al., 1998b) do not meet a minimum cutoff value. 3. calculating descriptive values 4. statistical analysis to determine, where possible, additional nucleotides to extend the length of the SAGE tag (thus facilitating more accurate tag to gene mapping).
Both regular SAGE (14mer tag) and LongSAGE (21mer tag) are supported by this module. Future protocols should be configurable with this module.
REFERENCES
Velculescu V, Zhang L, Vogelstein B, Kinzler KW. (1995) Serial analysis of gene expression. Science. 270:484-487. Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu V. (2002) Using the transcriptome to annotate the genome. Nat. Biotechnol. 20:508-512. Ewing B, Hillier L, Wendl MC, Green P. (1998a) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175-185. Ewing B, Green P. (1998b) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186-194.
Follow the usual steps for installing any Perl module:
perl Makefile.PL make test make install
None (this module used to require the Statistics::Distributions package).
Statistics::Distributions
1.20 2004.10.15 - Minor spelling errors and misuse of terminology fixed in docs. - Module now allows FASTA files with a blank line folling the header (denoting an attempted read with no sequence), but prints a warning to STDERR that this has happened. Module died previously. 1.11 2004.06.20 - Added flag in constructor to keep duplicate ditags. 1.10 2004.06.02 - Wrote new documentation and modified several methods to use the read-by-read processing approach (see line below). - Revamped the module to conserve memory. Reads are now processed one at a time and then discarded. The memory requirements in the previous versions were prohibitive to those with regular desktop machines. - The Bio::SAGE::DataProcessing::Filter package can be subclassed to create custom filters at the ditag and tag processing steps (previous versions only allowed one approach to ditag/tag filtering). 1.01 2004.05.27 - Fixed bug where extract_tag_counts didn't work with quality cutoff defined. - extract_tags was not applying the get_quality_cutoff value (was returning all data) - Duplicate ditags are now removed by default. 1.00 2004.05.23 - Initial release.
Globals
$PROTOCOL_SAGE
Hashref containing default protocol parameters for the regular/original SAGE protocol (see set_protocol documentation for more information).
$PROTOCOL_LONGSAGE
Hashref containing default protocol parameters for the LongSAGE protocol (see set_protocol documentation for more information).
$ENZYME_NLAIII = 'CATG'
Constant denoting the recognition sequence for NlaIII.
$ENZYME_SAU3A = 'GATC'
Constant denoting the recognition sequence for Sau3a.
$DEFAULT_TAG_FILTER
A default tag filter used when none is specified. This filter rejects tags that contain any base pair with a Phred quality score < 15 or an average Phred quality score over all bases < 30.
$DEFAULT_DITAG_FILTER
A default ditag filter used when none is specified. This filter rejects ditags that contain any base pair with a Phred quality score < 20.
Settings
$DEBUG = 0
Prints debugging output if value if >= 1.
Constructor for a new Bio::SAGE::DataProcessing object.
Arguments
$enzyme (optional)
The anchoring enzyme recognition site. This is typically NlaIII (CATG) or Sau3a (GATC). The default is the recognition sequence "CATG" (NlaIII).
\%protocol (optional)
A hashref containing specifics of the protocol. Two pre-made parameter sets are available: $PROTOCOL_SAGE (regular SAGE) and $PROTOCOL_LONGSAGE (LongSAGE). The required hash keys: MINIMUM_DITAG_LENGTH | The shortest length a valid ditag can be (the length should include the anchoring enzyme site sequences). MAXIMUM_DITAG_LENGTH | The longest length a valid ditag can be (the length should include the anchoring enzyme site sequences). TAG_LENGTH | The expected tag length (the length should NOT include the anchoring enzyme site sequence). IGNORE_TAGS | An arrayref listing tag sequences that should be ignored during tag extraction (i.e. linker-derived sequences). The parameters for the default configurations are: +----------------+--------------------+ | $PROTOCOL_SAGE | $PROTOCOL_LONGSAGE | +----------------------+----------------+--------------------+ | MINIMUM_DITAG_LENGTH | 29 | 40 | +----------------------+----------------+--------------------+ | MAXIMUM_DITAG_LENGTH | 32 | 46 | +----------------------+----------------+--------------------+ | TAG_LENGTH | 10 | 17 | +----------------------+----------------+--------------------+ | IGNORE_TAGS | TCCCTATTAA | TCGGACGTACATCGTTA | | | TCCCCGTACA | TCGGATATTAAGCCTAG | +----------------------+----------------+--------------------+ Once the Bio::SAGE::DataProcessing object has been instantiated, the enzyme and protocol CANNOT be altered.
$bKeepDuplicates (optional)
If this argument is TRUE (non-zero) then ditags with identical sequence will be kept. The default behavior is to discard these ditags as they likely represent preferential PCR amplification.
Usage
my $sage = Bio::SAGE::DataProcessing->new( $Bio::SAGE::DataProcessing::ENZYME_NLAIII, $Bio::SAGE::DataProcessing::PROTOCOL_LONGSAGE ); # alternatively: my $sage = Bio::SAGE::DataProcessing->new( "CATG", { 'MINIMUM_DITAG_LENGTH' => 31, 'MAXIMUM_DITAG_LENGTH' => 34, 'TAG_LENGTH' => 11, 'IGNORE_TAGS' => \( 'TCCCTATTAA', 'TCCCCGTACA' ) } );
Sets a new filter object (a concrete subclass of Bio::SAGE::DataProcessing::Filter) that is applied to ditags as they're extracted from sequence reads.
The default filter uses a Bio::SAGE::DataProcessing::MinimumPhredFilter instance that is set to reject any ditags that do not have at least Phred 20 (p<=0.01) for all nucleotides.
$filterObject
An object ref to a concrete subclass of Bio::SAGE::DataProcessing::Filter.
Sets a new filter object (a concrete subclass of Bio::SAGE::DataProcessing::Filter) that is applied to tags as they're extracted from ditags.
The default filter uses a Bio::SAGE::DataProcessing::AveragePhredFilter instance that is set to reject any tags that do not have an average Phred 30 score (p<=0.001) over all nucleotides and at least Phred 15 (p<=0.0316) at each nucleotide.
Gets the current anchoring enzyme recognition site.
None.
Returns
The current anchoring enzyme recognition site. By default, this will be 'CATG', the NlaIII recognition site.
my $sage = Bio::SAGE::DataProcessing->new(); print "ENZYME_SITE: " . $sage->get_enzyme() . "\n";
Gets a copy (in other words, modifying the returned hash does not affect the object's settings) of the hash describing the protocol for this instance.
A hashref of the current protocol settings. See the documentation for new (constructor) for more details on the contents of this hash.
my $sage = Bio::SAGE::DataProcessing->new(); print "Default protocol:\n"; my %protocol = %{$sage->get_protocol()}; map { print $_ . "=" . $protocol{$_} . "\n" } keys %protocol;
Processes reads from a Perl handle to FASTA format sequence data. Phred scores in matching FASTA format can be read concurrently.
An example of FASTA format sequence:
>SEQUENCE0001 ACAGATAGACAGAGATATAGAGACATATTTAGAGACAAATCGCGCAGGCGCGCGACATA TGACTAGTTTATATCATCAGTATTAGCGATTATGACTATTATATATTACTGATTGATCT ATAGCGCGATTATATCTATCTATGCATTCGATCATGCTATTATCGTATACTACTGCTAG AGAGGACGACGCAGGCAGCGATTATATCTATTTATA >SEQUENCE0002 CGCGACGCATGTCAGTAGCTAGCTGCGCCCGAATATATATCGTCATACGGATTCGTAGC CCCCCGCGGAGTCTGATTATATCTGATT
An example of FASTA format quality data:
>SEQUENCE0001 11 17 18 16 19 17 19 19 16 19 19 16 11 10 9 15 10 12 24 24 35 29 29 39 40 40 40 40 37 37 46 46 40 40 40 40 56 56 56 56 35 35 35 35 35 35 56 40 40 46 40 40 39 39 35 39 56 56 51 51 51 51 51 56 35 35 35 35 35 35 40 40 51 45 51 51 39 39 39 39 39 39 40 40 40 40 40 40 56 56 56 56 56 46 46 40 39 39 39 45 45 45 56 56 56 56 56 56 56 56 40 39 39 39 39 35 39 39 39 39 45 56 45 45 45 45 51 35 39 39 39 39 39 40 40 40 40 40 51 56 56 40 40 40 40 40 43 56 56 56 43 35 35 35 35 35 43 45 45 45 45 45 45 51 51 51 51 51 51 56 56 56 56 56 56 51 51 51 56 56 7 7 9 9 11 10 13 11 10 8 10 10 8 8 8 10 10 >SEQUENCE0002 12 15 17 17 19 15 15 15 13 19 17 17 12 16 11 19 13 24 24 35 35 35 37 37 39 56 56 56 56 56 51 39 32 29 29 29 29 32 56 56 56 35 35 35 35 35 35 56 56 56 56 56 56 56 56 56 56 56 56 40 40 40 46 46 40 51 40 40 40 40 40 40 51 39 39 35 35 35 35 40 40 51 45 45 45 45 51 51 56 56 56 56 56 45 45 45 45 51 51 45 45 45 40 40 40 40 40 40 40 40 40 40 56 56 56 56 56 56 51 51 15 13 19 17 17 12 16 11 19 13
$sequence_handle
A Perl handle to sequence data in FASTA format.
$scores_handle (optional)
A Perl handle to Phred scores in FASTA format. The order of entries must match the $sequence_handle data. NOTE: many implementations of Bio::SAGE::DataProcessing::Filter require the scores to carry out their function. In the default implementations, no filtering is done when scores are not defined.
The number of sequences read.
my $sage = Bio::SAGE::DataProcessing->new(); open( SEQDATA, "sequences.fasta" ); my $nReads = $sage->process_library( *SEQDATA ); print "NUMBER_READS: $nReads\n";
Extracts and filters ditags from the given sequence read (and optionally supplied Phred scores). The default ditag filter (or a filter supplied to set_ditag_filter) is applied during this procedure. The resulting ditags are added to the list of currently processed ditags collected from previous calls to this method.
Ditags with identical sequence (duplicate ditags) are considered the result of PCR artifacts and only the ditag with the highest "score" (as defined by the current ditag Bio::SAGE::DataProcessing::Filter) is retained.
$sequence
The nucleotide sequence of the sequence read to process.
$scores (optional)
The Phred scores corresponding to the sequence read supplied to the method. The method expects space-separated Phred scores (ie. "20 24 54 32" etc.).
TRUE (1) if the method was successful, FALSE (0) otherwise.
my $sage = Bio::SAGE::DataProcessing->new(); my $sequence = "ACGTACGT"; my $scores = "20 25 34 12 23 45 51 23"; if( $sage->process_read( $sequence, $scores ) ) { print "Successful!\n"; }
Gets the list of currently extracted and valid ditags stored in this object through calls to process_read.
An array of ditag sequences.
my $sage = Bio::SAGE::DataProcessing->new(); my $sequence = "ACGTACGT"; my $scores = "20 25 34 12 23 45 51 23"; if( $sage->process_read( $sequence, $scores ) ) { my @ditags = $sage->get_ditags(); print "Current ditags:\n"; map { print " ".$_."\n" } @ditags; }
Uses the list of currently extracted and valid ditags to generate a list of valid SAGE tags. The default tag filter (or a filter supplied to set_tag_filter) is applied during this procedure.
An array of tag sequences.
my $sage = Bio::SAGE::DataProcessing->new(); my $sequence = "ACGTACGT"; my $scores = "20 25 34 12 23 45 51 23"; if( $sage->process_read( $sequence, $scores ) ) { my @tags = $sage->get_tags(); print "Current tags:\n"; map { print " ".$_."\n" } @tags; }
Extracts valid tags from ditags and returns a hashref containing tag sequences paired with their respective counts.
A hashref where the tag sequence is paired with its observed number.
my $sage = Bio::SAGE::DataProcessing->new(); my @reads = ( 'ACGTAGACATAGACAAGAGATATAGA', 'GATAGACAAAGGAAGATTACAAGATTAT' ); foreach my $read ( @reads ) { $sage->process_read( $read ); } # get tag counts my %counts = %{$sage->get_tagcounts()}; # print tag counts map { print join( "\t", $_, $counts{$_} ) . "\n" } keys %counts;
Calculates the distribution of bases at each position and both orientations of a set of ditags. This distribution is used for calculating the 'expected' nucleotide count when determining extra bases using get_extra_base_calculation.
For example:
CATGAAACCGTATGTAGAGAGGGACACATG CATGTAGACAGATAGACACAGATACCATG
has a distribution of:
+---------------+---------------+ | forward | reverse | +-----+---+---+---+---+---+---+---+---+ | pos | A | C | G | T | A | C | G | T | +-----+---+---+---+---+---+---+---+---+ | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | | 1 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | | 3 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | | 4 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | | 5 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | | 6 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | | ... | | 28 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | | 29 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | +-----+---+---+---+---+---+---+---+---+
$minLength (optional)
Ignore ditags that are smaller than this minimum length. If the argument is not supplied, then the minimum ditag length for the currently selected protocol is used.
$maxLength (optional)
Ignore ditags that are larger than this maximum length. If the argument is not supplied, then the maximum ditag length for the currently selected protocol is used.
A hashref where the key is the zero-based base position index, and the value is a hashref where the key is the nucleotide and the value is a hashref where the key is either 'fwd' or 'rev' and the value is the count of that nucleotide (whew!). i.e. $HASH{$idx}->{'A'}->{'fwd'} = 23;
my $sage = Bio::SAGE::DataProcessing->new(); my @ditags = ( 'CATGAAACCGTATGTAGAGAGGGACACATG', 'CATGTAGACAGATAGACACAGATACCATG' ); my %DIST = %{$sage->get_ditag_base_distribution()}; # print distribution table foreach my $idx ( sort { $a <=> $b } keys %DIST ) { print $idx . " "; print join( " ", map { defined( $DIST{$idx}->{$_} ) ? $DIST{$idx}->{$_} : 0 } ( 'A','C','G','T' ) ); print "\n"; }
Calculates the distribution of ditag lengths for a set of ditags.
CATGAAACCGTATGTAGAGAGGGACACATG CATGTAGACAGATAGACACAGATACCATG CATGTATCGCGGCATTACGATCTAGAACATG CATGACGACTATATCGATAGCTAACCATG
+-----+---+ | len | N | +-----+---+ | 29 | 2 | | 30 | 1 | | 31 | 1 | +-----+---+
A hashref where the key is the ditag length, and the value is the number of ditags that have this length. i.e. $HASH{$len} = 1024;
my %DIST = %{$sage->get_ditag_length_distribution()}; # print distribution table foreach my $idx ( sort { $a <=> $b } keys %DIST ) { print join( "\t", $idx, $DIST{$idx} ) . "\n"; }
Calculates the number of times a given nucleotide is seen at additional positions of a SAGE tag. This method is only supported in methods where there the range of ditag size allows for extra bases to be included in some (or all) population of ditags.
For example, the ditag sequence:
CATGTATCGCGGCATTACGATCTAGAACATG
becomes:
CATGTATCGCGGCA and CATGTTCTAGATCG
but an additonal TTA sequence is stored in the middle. Some or all of these extra bases may arise from one or both of the tags.
$tag
The tag sequence that is the focus of the extra base calculation. Only ditags that have this tag sequence are considered in the calculation. The method checks the length of specified tag and checks whether it begins with the expected anchoring enzyme site. If the tag appears to be missing just the anchoring enzyme site, it will append this automatically. Otherwise, the method will die.
A hashref that is several keys deep in the order: extra base position, ditag length, nucleotide. The key is the number of times the keyed event occured. In other words: $hash->{$position}->{$ditag_length}->{$nucleotide} = $obs;
my $dist = $sage->get_extra_base_calculation( "AAACGAATTA" ); # dump results foreach my $ditag_length ( keys %$dist ) { foreach my $position ( keys %{$dist->{$ditag_length}} ) { foreach my $nucleotide ( keys %{$dist->{$ditag_length}->{$position}} ) { print join( ",", $ditag_length, $position, $nucleotide ) . "\t" . $dist->{$ditag_length}->{$position}->{$nucleotide} . "\n"; } } }
Prints a formatted report to the specified handle.
An example output looks like:
+------+------+------+------+ | A | C | G | T | +----+---+------+------+------+------+ | 29 | 0 | 183 | 43 | 31 | 68 | | 30 | 0 | 2637 | 23 | 23 | 37 | | 31 | 0 | 2624 | 0 | 14 | 0 | | 32 | 0 | 665 | 0 | 1 | 0 | +----+---+------+------+------+------+ | 30 | 1 | 639 | 784 | 435 | 862 | | 31 | 1 | 188 | 1875 | 198 | 377 | | 32 | 1 | 4 | 658 | 0 | 4 | +----+---+------+------+------+------+ | 31 | 2 | 609 | 588 | 355 | 1086 | | 32 | 2 | 100 | 204 | 106 | 256 | +----+---+------+------+------+------+ | 32 | 3 | 199 | 95 | 88 | 284 | +----+---+------+------+------+------+
The first two columns are the ditag size and the extra base's relative 0-indexed position, respectively. The remaining columns are the four nucleotides and the number of ditags that met the condition described in the first two columns.
In this example, the investigator could strongly reason that the extra nucleotides are AC.
$resultRef
A properly formed hashref containing the results of an extra base calculation. This data structure can be obtained with the get_extra_base_calculation method (see the documentation for that method for more details).
$handle (optional)
A Perl handle to output the results to. If this argument is not specified, STDOUT is used by default.
my $dist = $sage->get_extra_base_calculation( "AAACGAATTA" ); $sage->print_extra_base_calculation( $dist, *STDERR );
Copyright(c)2004 Scott Zuyderduyn <scottz@bccrc.ca>. All rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Scott Zuyderduyn <scottz@bccrc.ca> BC Cancer Research Centre
Greg Vatcher <gvatcher@bccrc.ca> Canada's Michael Smith Genome Sciences Centre <http://www.bcgsc.ca>
1.20
Perl(1).
- Add more debugging messages. - Method to dump/access statistics collected during processing.
1 POD Error
The following errors were encountered while parsing the POD:
Non-ASCII character seen before =encoding in 'population’s'. Assuming CP1252
To install Bio::SAGE::DataProcessing, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::SAGE::DataProcessing
CPAN shell
perl -MCPAN -e shell install Bio::SAGE::DataProcessing
For more information on module installation, please visit the detailed CPAN module installation guide.