The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

BioX::SeqUtils::RandomSequence - Creates a random nuc or prot sequence with given nuc frequencies

VERSION

This document describes BioX::SeqUtils::RandomSequence version 0.9.4

SYNOPSIS

The randomizer object accepts parameters for sequence length (l), codon table (s), sequence type (y), and frequencies for each of the nucleotide bases in DNA (a, c, g, t). The defaults are shown below:

    use BioX::SeqUtils::RandomSequence;

    my $randomizer = BioX::SeqUtils::RandomSequence->new({ l => 2, 
                                                           s => 1,
                                                           y => "dna",
                                                           a => 1,
                                                           c => 1,
                                                           g => 1,
                                                           t => 1 });
    print $randomizer->rand_seq(), "\n";

DESCRIPTION

Create random DNA, RNA and protein sequences.

NUCLEOTIDE FREQUENCIES

All four frequencies are set to "1" by default ( so that the probablity of each A, C, G, T is 0.25 ). The frequencies should always be positive integers, and you should consider what you choose. The algorithm works by creating a template with length equal to the sum L of A_freq, C_freq, G_freq, and T_freq with exactly the numbers of each assigned to those frequencies. The template is resorted for each L length part of the required sequence (and trimmed to required length). For example, using the default frequencies, a sequence 100 bases long will have exactly 25 A, 25 C, 25 G, and 25 T. If you want sequences from a wider distribution, use four digit (or greater) values for the frequencies. For a sequence length of a few dozen bases, this example would be broad enough to create repeat islands: ($A_freq, $C_freq, $G_freq, $T_freq) = (2245, 2755, 2755, 2245).

NUCLEOTIDE FREQUENCIES UNDERLIE PROTEINS

Protein sequences are translated from random DNA sequence of the necessary length using the assigned nucleotide frequencies. This module does not allow you to directly influence the amino acid frequencies. If you need this sort of functionality, please contact the author.

METHODS

  • rand_seq()

    After creating a randomizer object, each sequence type can be accessed using the "y" (tYpe) parameter with rand_seq(). The default type is "2" (for dinucleotide, a length two dna sequence). The other types are "d" (dna), "r" (rna), "p" (protein), and "s" (protein set).

    You can use the same randomizer object to create all types of sequences, by passing the changing parameters with each call.

        my $dinucleotide  = $randomizer->rand_seq();                       # Default settings
        my $nuc_short     = $randomizer->rand_seq({ y => 'd', l => 21 });  # Create DNA length 21
        my $nuc_long      = $randomizer->rand_seq({ l => 2200 });          # Still DNA, now length 2200
        my $nuc_richer    = $randomizer->rand_seq({ a => 225, 
                                                    c => 275, 
                                                    g => 275, 
                                                    t => 225 });           # Still length 2200, GC richer
        my $protein_now   = $randomizer->rand_seq({ y => 'p' });           # Still richer GC
        my $protein_def   = $randomizer->rand_seq({ a => 1 });             # Missing bases resets all freq to 1
        my $protein_new   = $randomizer->rand_seq({ y => 'p',
                                                    s => 3 });             # Use codon table 'Yeast Mitochondrial'

    The type parameter only works with rand_seq().

  • rand_dna()

    This method may be used directly to create DNA sequences.

        my $dinucleotide  = $randomizer->rand_dna();
        my $dna           = $randomizer->rand_dna({ l => 2200 });
           $dna           = $randomizer->rand_seq({ l => 200, 
                                                    a => 225, 
                                                    c => 275, 
                                                    g => 275, 
                                                    t => 225 });           # Larger variance
  • rand_rna()

    This method may be used directly to create RNA sequences.

        my $rna           = $randomizer->rand_rna({ l => 21 });
           $rna           = $randomizer->rand_rna({ l => 1000, 
                                                    a => 225, 
                                                    c => 275, 
                                                    g => 275, 
                                                    t => 225 });       
  • rand_pro()

    This method may be used directly to create protein sequences.

    A protein of the given length L is created by translating a random DNA sequence of length L * 3 with the given nucleotide frequencies.

        my $protein       = $randomizer->rand_pro();
  • rand_pro_set()

    This method may be used directly to create a protein sequence set.

    A protein set is correlatable at the DNA level by creating a random DNA sequence with the given nucleotide frequencies of length L * 3 + 1, removing the first base for sequence 1 and removing the last base for sequence 2, then translating them into proteins.

    This method uses wantarray(), and will either return a list or list reference (scalar) depending on the context:

        my ($pro1, $pro2) = $randomizer->rand_pro_set();
        my $protein_set   = $randomizer->rand_pro_set();

SCRIPTS

The package includes scripts for random dna, rna, dinucleotide, and protein sequences. The length and frequency parameters should always be integers.

To create a dinucleotide sequence:

    ./random-dna.pp                                      # Defaults: length 2, all frequencies 1
    ./random-dna.pp -a250 -c250 -g250 -t250              # Create broader distribution

To create a dna sequence:

    ./random-dna.pp -l21                                 # Defaults: all frequencies 1 ( p = .25 )
    ./random-dna.pp -l2200 -a23 -c27 -g27 -t23           # Enrich GC content with length 2200

To create a rna sequence:

    ./random-rna.pp -l100                                     
    ./random-rna.pp -l2200 -a23 -c27 -g27 -t23           

To create a protein sequence:

    ./random-protein.pp                                  # Defaults: length 2, all frequencies .25
    ./random-protein.pp -l2200 -a23 -c27 -g27 -t23       # Enrich underlying GC content, aa length 2200

To create a protein set (with common DNA shifted by one base):

    ./random-protein-set.pp                              # Defaults: length 2, all frequencies .25
    ./random-protein-set.pp -l2200 -a23 -c27 -g27 -t23   # Enrich underlying GC content 

Additionally, a "master script" uses a tYpe parameter for any:

    ./random-sequence.pp                                 # Type 2 dinucleotide
    ./random-sequence.pp -yd -l100                       # Type d dna
    ./random-sequence.pp -yr -l100                       # Type r rna
    ./random-sequence.pp -yp -l100                       # Type p protein
    ./random-sequence.pp -ys -l100                       # Type s protein set

This module uses Bio::Tools::CodonTable for translations, and the parameter s can be used to change from the default (1) "Standard":

    ./random-protein.pp -l2200 -s2                       # Non-standard codon table

CONFIGURATION AND ENVIRONMENT

None.

DEPENDENCIES

    Class::Std;
    Class::Std::Utils;
    Bio::Tools::CodonTable;

INCOMPATIBILITIES

None reported.

BUGS AND LIMITATIONS

No bugs have been reported.

Please report any bugs or feature requests to bug-biox-sequtils-randomsequence@rt.cpan.org, or through the web interface at http://rt.cpan.org.

AUTHOR

Roger A Hall <rogerhall@cpan.org>

LICENSE AND COPYRIGHT

Copyleft (c) 2009, Roger A Hall <rogerhall@cpan.org>. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.