NAME
BioX::SeqUtils::RandomSequence - Creates a random nuc or prot sequence with given nuc frequencies
VERSION
This document describes BioX::SeqUtils::RandomSequence version 0.9.4
SYNOPSIS
The randomizer object accepts parameters for sequence length (l), codon table (s), sequence type (y), and frequencies for each of the nucleotide bases in DNA (a, c, g, t). The defaults are shown below:
use BioX::SeqUtils::RandomSequence;
my $randomizer = BioX::SeqUtils::RandomSequence->new({ l => 2,
s => 1,
y => "dna",
a => 1,
c => 1,
g => 1,
t => 1 });
print $randomizer->rand_seq(), "\n";
DESCRIPTION
Create random DNA, RNA and protein sequences.
NUCLEOTIDE FREQUENCIES
All four frequencies are set to "1" by default ( so that the probablity of each A, C, G, T is 0.25 ). The frequencies should always be positive integers, and you should consider what you choose. The algorithm works by creating a template with length equal to the sum L of A_freq, C_freq, G_freq, and T_freq with exactly the numbers of each assigned to those frequencies. The template is resorted for each L length part of the required sequence (and trimmed to required length). For example, using the default frequencies, a sequence 100 bases long will have exactly 25 A, 25 C, 25 G, and 25 T. If you want sequences from a wider distribution, use four digit (or greater) values for the frequencies. For a sequence length of a few dozen bases, this example would be broad enough to create repeat islands: ($A_freq, $C_freq, $G_freq, $T_freq) = (2245, 2755, 2755, 2245).
NUCLEOTIDE FREQUENCIES UNDERLIE PROTEINS
Protein sequences are translated from random DNA sequence of the necessary length using the assigned nucleotide frequencies. This module does not allow you to directly influence the amino acid frequencies. If you need this sort of functionality, please contact the author.
METHODS
rand_seq()
After creating a randomizer object, each sequence type can be accessed using the "y" (tYpe) parameter with rand_seq(). The default type is "2" (for dinucleotide, a length two dna sequence). The other types are "d" (dna), "r" (rna), "p" (protein), and "s" (protein set).
You can use the same randomizer object to create all types of sequences, by passing the changing parameters with each call.
my $dinucleotide = $randomizer->rand_seq(); # Default settings my $nuc_short = $randomizer->rand_seq({ y => 'd', l => 21 }); # Create DNA length 21 my $nuc_long = $randomizer->rand_seq({ l => 2200 }); # Still DNA, now length 2200 my $nuc_richer = $randomizer->rand_seq({ a => 225, c => 275, g => 275, t => 225 }); # Still length 2200, GC richer my $protein_now = $randomizer->rand_seq({ y => 'p' }); # Still richer GC my $protein_def = $randomizer->rand_seq({ a => 1 }); # Missing bases resets all freq to 1 my $protein_new = $randomizer->rand_seq({ y => 'p', s => 3 }); # Use codon table 'Yeast Mitochondrial'
The type parameter only works with rand_seq().
rand_dna()
This method may be used directly to create DNA sequences.
my $dinucleotide = $randomizer->rand_dna(); my $dna = $randomizer->rand_dna({ l => 2200 }); $dna = $randomizer->rand_seq({ l => 200, a => 225, c => 275, g => 275, t => 225 }); # Larger variance
rand_rna()
This method may be used directly to create RNA sequences.
my $rna = $randomizer->rand_rna({ l => 21 }); $rna = $randomizer->rand_rna({ l => 1000, a => 225, c => 275, g => 275, t => 225 });
rand_pro()
This method may be used directly to create protein sequences.
A protein of the given length L is created by translating a random DNA sequence of length L * 3 with the given nucleotide frequencies.
my $protein = $randomizer->rand_pro();
rand_pro_set()
This method may be used directly to create a protein sequence set.
A protein set is correlatable at the DNA level by creating a random DNA sequence with the given nucleotide frequencies of length L * 3 + 1, removing the first base for sequence 1 and removing the last base for sequence 2, then translating them into proteins.
This method uses wantarray(), and will either return a list or list reference (scalar) depending on the context:
my ($pro1, $pro2) = $randomizer->rand_pro_set(); my $protein_set = $randomizer->rand_pro_set();
SCRIPTS
The package includes scripts for random dna, rna, dinucleotide, and protein sequences. The length and frequency parameters should always be integers.
To create a dinucleotide sequence:
./random-dna.pp # Defaults: length 2, all frequencies 1
./random-dna.pp -a250 -c250 -g250 -t250 # Create broader distribution
To create a dna sequence:
./random-dna.pp -l21 # Defaults: all frequencies 1 ( p = .25 )
./random-dna.pp -l2200 -a23 -c27 -g27 -t23 # Enrich GC content with length 2200
To create a rna sequence:
./random-rna.pp -l100
./random-rna.pp -l2200 -a23 -c27 -g27 -t23
To create a protein sequence:
./random-protein.pp # Defaults: length 2, all frequencies .25
./random-protein.pp -l2200 -a23 -c27 -g27 -t23 # Enrich underlying GC content, aa length 2200
To create a protein set (with common DNA shifted by one base):
./random-protein-set.pp # Defaults: length 2, all frequencies .25
./random-protein-set.pp -l2200 -a23 -c27 -g27 -t23 # Enrich underlying GC content
Additionally, a "master script" uses a tYpe parameter for any:
./random-sequence.pp # Type 2 dinucleotide
./random-sequence.pp -yd -l100 # Type d dna
./random-sequence.pp -yr -l100 # Type r rna
./random-sequence.pp -yp -l100 # Type p protein
./random-sequence.pp -ys -l100 # Type s protein set
This module uses Bio::Tools::CodonTable for translations, and the parameter s can be used to change from the default (1) "Standard":
./random-protein.pp -l2200 -s2 # Non-standard codon table
CONFIGURATION AND ENVIRONMENT
None.
DEPENDENCIES
Class::Std;
Class::Std::Utils;
Bio::Tools::CodonTable;
INCOMPATIBILITIES
None reported.
BUGS AND LIMITATIONS
No bugs have been reported.
Please report any bugs or feature requests to bug-biox-sequtils-randomsequence@rt.cpan.org
, or through the web interface at http://rt.cpan.org.
AUTHOR
Roger A Hall <rogerhall@cpan.org>
LICENSE AND COPYRIGHT
Copyleft (c) 2009, Roger A Hall <rogerhall@cpan.org>
. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.
DISCLAIMER OF WARRANTY
BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.