Bio::RNA::SpliceSites::Scoring::MaxEntScan - Perl module for pre-mRNA splice site scoring by the maxEntScan algorithm of Gene Yeo and Chris Burge.
use Bio::RNA::SpliceSites::Scoring::MaxEntScan qw/ score5 /;
my $five_prime_splice_site = q/ CAGGTTGGC /;
my $five_prime_splice_site_score = score5( \$five_prime_splice_site ); #Return value is a scalar, not a reference.
use Bio::RNA::SpliceSites::Scoring::MaxEntScan qw/ score3 /;
my $three_prime_splice_site = q/ ctctactactatctatctagatc /; #Both scoring subroutines are case-insensitive.
my $three_prime_splice_site_score = score3( \$three_prime_splice_site ); #Returns 6.71.
use Bio::MaxEntScan::SpliceSites::Scoring::MaxEntScan qw/ :all /; #Imports both subroutines.
This module scores 5' and 3' splice sites using the maxEntScan algorithm. See the original publication (citattion below) for details on the scoring algorithm.
None by default. The following two functions are available for export:
score5 score3
Both of these functions emulate the original maxEntScan scripts of the same names, except that they do not return a sequence string, only the score. See below for descriptions.
The all tag:
:all
...imports both subroutines.
5' splice sites must be 9 nucleotides long and must contain the 3' (terminal) 3 nucleotides of the exon and the first 6 nucleotides of the 5' end of the intron. 3' splice sites must be 23 nucleotides long and must contain the 3' (terminal) 20 nucleotides of the intron and the first 2 nucleotides of the 5' end of the exon.
Both functions will provide error messages on the standard error stream if a splice site of improper length is passed by reference.
Additional errors include an invalid genetic alphabet (must contain only [ACTGactg] nucleotides, no 'N' nucleotides are allowed by the algorithm) or passing a non-reference to the scoring subroutine(s).
The function will still return a value for errors to maintain output file structure. These are:
'invalid_length' An invalid splice site length is provided. 'invalid_alphabet' Nucleotides other than [ACTGactg] were encountered, and the splice site cannot be scored. 'invalid_invocation' A value that was not a scalar reference was passed to the scoring subroutine.
When passed a reference to a scalar containing a nonamer sequence representing a 5' splice site to score, returns a scalar containing the score.
5' splice sites must be 9 nucleotides long and must contain the 3' (terminal) 3 nucleotides of the exon and the first 6 nucleotides of the 5' end of the intron.
Both splice site scoring functions will provide error messages on the standard error stream if a splice site of improper length is passed by reference.
When passed a reference to a scalar containing a 23mer sequence representing a 3' splice site to score, returns a scalar containing the score.
3' splice sites must be 23 nucleotides long and must contain the 3' (terminal) 20 nucleotides of the intron and the first 2 nucleotides of the 5' end of the exon.
The same error messages generated by score5() will be returned for an invalid subroutine invocation, and invalid 3'ss length, or an invalid genetic alphabet.
The following subroutines are used internally by the above splice site scoring functions.
Returns the maxEntScore for the 3'ss. This subroutine was developed from the getmaxentscore() subroutine in the original score3.pl script provided with maxEntScan from MIT.
Returns the score matrix value for a provided 5'ss.
Returns the sequence matrix value for a provided 5'ss.
Converts an oligonucleotide sequence (all uppercase) to a 4-radix integer. This approach was used in the original maxEntScan score3.pl program.
Returns 1 (TRUE) if the sequence passed to the subroutine is in a valid genetic alphabet, 0 (FALSE) otherwise.
When passed a sequence and an expected length, returns 1 (TRUE) if the sequence is the expected length, and 0 (FALSE) otherwise.
Checks the first argument to see if it is a reference, returning 1 (TRUE) if yes, otherwise 0 (FALSE).
Converts its argument into a log2. See the documentation for the `log` function.
When passed a splice site consensus dinucleotide and the splice site type as an integer (either 5 or 3), scores the splice donor or splice acceptor dinucleotide according to background values specific for the specified splice site type. This subroutine is used by both score5() and score3() subroutines.
When passed a scalar splice site sequence and the splice site type as an integer (either 5 or 3), splits the splice site into the splice donor/acceptor dinucleotide and the concatenated remainder of the scalar. This subroutine is used by both the score5() and score3() subroutines.
Algorithm:
J Comput biol. 2004;11(2-3):377-94 Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Yeo G, Burge CB PMID: 15285897
Brian Sebastian Cole, <colebr@mail.med.upenn.edu>
maxEntScan algorithm: Copyright (C) 2004 by Gene Yeo and Chris Burge
This distrubtion: Copyright (C) 2014,2015 by Brian Sebastian Cole
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
The author would like to acknowledge the support of his thesis advisor Dr. Kristen Lynch, PhD.
Thanks go to John Karr of the Philadelphia Perl Mongers for the sagacious suggestion of using data submodules to hold splice models.
To install Bio::RNA::SpliceSites::Scoring::MaxEntScan, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::RNA::SpliceSites::Scoring::MaxEntScan
CPAN shell
perl -MCPAN -e shell install Bio::RNA::SpliceSites::Scoring::MaxEntScan
For more information on module installation, please visit the detailed CPAN module installation guide.