NAME

Bio::RNA::SpliceSites::Scoring::MaxEntScan - Perl module for pre-mRNA splice site scoring by the maxEntScan algorithm of Gene Yeo and Chris Burge.

SYNOPSIS

use Bio::RNA::SpliceSites::Scoring::MaxEntScan qw/ score5 /;

my $five_prime_splice_site = q/ CAGGTTGGC /;

my $five_prime_splice_site_score = score5( \$five_prime_splice_site ); #Return value is a scalar, not a reference.

use Bio::RNA::SpliceSites::Scoring::MaxEntScan qw/ score3 /;

my $three_prime_splice_site = q/ ctctactactatctatctagatc /; #Both scoring subroutines are case-insensitive.

my $three_prime_splice_site_score = score3( \$three_prime_splice_site ); #Returns 6.71.

use Bio::MaxEntScan::SpliceSites::Scoring::MaxEntScan qw/ :all /; #Imports both subroutines.

DESCRIPTION

This module scores 5' and 3' splice sites using the maxEntScan algorithm. See the original publication (citattion below) for details on the scoring algorithm.

EXPORT

None by default. The following two functions are available for export:

score5 score3

Both of these functions emulate the original maxEntScan scripts of the same names, except that they do not return a sequence string, only the score. See below for descriptions.

The all tag:

:all

...imports both subroutines.

5' splice sites must be 9 nucleotides long and must contain the 3' (terminal) 3 nucleotides of the exon and the first 6 nucleotides of the 5' end of the intron. 3' splice sites must be 23 nucleotides long and must contain the 3' (terminal) 20 nucleotides of the intron and the first 2 nucleotides of the 5' end of the exon.

Both functions will provide error messages on the standard error stream if a splice site of improper length is passed by reference.

Additional errors include an invalid genetic alphabet (must contain only [ACTGactg] nucleotides, no 'N' nucleotides are allowed by the algorithm) or passing a non-reference to the scoring subroutine(s).

The function will still return a value for errors to maintain output file structure. These are:

'invalid_length' An invalid splice site length is provided. 'invalid_alphabet' Nucleotides other than [ACTGactg] were encountered, and the splice site cannot be scored. 'invalid_invocation' A value that was not a scalar reference was passed to the scoring subroutine.

SUBROUTINES FOR SPLICE SITE SCORING

score5

When passed a reference to a scalar containing a nonamer sequence representing a 5' splice site to score, returns a scalar containing the score.

5' splice sites must be 9 nucleotides long and must contain the 3' (terminal) 3 nucleotides of the exon and the first 6 nucleotides of the 5' end of the intron.

Both splice site scoring functions will provide error messages on the standard error stream if a splice site of improper length is passed by reference.

Additional errors include an invalid genetic alphabet (must contain only [ACTGactg] nucleotides, no 'N' nucleotides are allowed by the algorithm) or passing a non-reference to the scoring subroutine(s).

The function will still return a value for errors to maintain output file structure. These are:

'invalid_length' An invalid splice site length is provided. 'invalid_alphabet' Nucleotides other than [ACTGactg] were encountered, and the splice site cannot be scored. 'invalid_invocation' A value that was not a scalar reference was passed to the scoring subroutine.

score3

When passed a reference to a scalar containing a 23mer sequence representing a 3' splice site to score, returns a scalar containing the score.

3' splice sites must be 23 nucleotides long and must contain the 3' (terminal) 20 nucleotides of the intron and the first 2 nucleotides of the 5' end of the exon.

The same error messages generated by score5() will be returned for an invalid subroutine invocation, and invalid 3'ss length, or an invalid genetic alphabet.

INTERNAL SUBROUTINES

The following subroutines are used internally by the above splice site scoring functions.

get_max_ent_score

Returns the maxEntScore for the 3'ss. This subroutine was developed from the getmaxentscore() subroutine in the original score3.pl script provided with maxEntScan from MIT.

get_splice_5_score_matrix_value

Returns the score matrix value for a provided 5'ss.

get_splice_5_sequence_matrix_value

Returns the sequence matrix value for a provided 5'ss.

hash_seq

Converts an oligonucleotide sequence (all uppercase) to a 4-radix integer. This approach was used in the original maxEntScan score3.pl program.

is_genetic_alphabet

Returns 1 (TRUE) if the sequence passed to the subroutine is in a valid genetic alphabet, 0 (FALSE) otherwise.

is_kmer

When passed a sequence and an expected length, returns 1 (TRUE) if the sequence is the expected length, and 0 (FALSE) otherwise.

is_scalar_reference

Checks the first argument to see if it is a reference, returning 1 (TRUE) if yes, otherwise 0 (FALSE).

log2

Converts its argument into a log2. See the documentation for the `log` function.

score_consensus

When passed a splice site consensus dinucleotide and the splice site type as an integer (either 5 or 3), scores the splice donor or splice acceptor dinucleotide according to background values specific for the specified splice site type. This subroutine is used by both score5() and score3() subroutines.

split_sequence

When passed a scalar splice site sequence and the splice site type as an integer (either 5 or 3), splits the splice site into the splice donor/acceptor dinucleotide and the concatenated remainder of the scalar. This subroutine is used by both the score5() and score3() subroutines.

SEE ALSO

Algorithm:

J Comput biol. 2004;11(2-3):377-94 Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Yeo G, Burge CB PMID: 15285897

AUTHOR

Brian Sebastian Cole, <colebr@mail.med.upenn.edu>

COPYRIGHT AND LICENSE

maxEntScan algorithm: Copyright (C) 2004 by Gene Yeo and Chris Burge

This distrubtion: Copyright (C) 2014,2015 by Brian Sebastian Cole

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

ACKNOWLEDGEMENTS

The author would like to acknowledge the support of his thesis advisor Dr. Kristen Lynch, PhD.

Thanks go to John Karr of the Philadelphia Perl Mongers for the sagacious suggestion of using data submodules to hold splice models.