The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

BioUtil::Seq - Utilities for sequence

Some great modules like BioPerl provide many robust solutions. However, it is not easy to install for someone in some platforms. And for some simple task scripts, a lite module may be a good choice. So I reinvented some wheels and added some useful utilities into this module, hoping it would be helpful.

VERSION

Version 2015.0309

EXPORT

    FastaReader
    read_sequence_from_fasta_file 
    write_sequence_to_fasta_file 
    format_seq

    validate_sequence 
    complement
    revcom 
    base_content 
    degenerate_seq_to_regexp
    match_regexp
    dna2peptide 
    codon2aa 
    generate_random_seqence

    shuffle_sequences 
    rename_fasta_header 
    clean_fasta_header 

SYNOPSIS

  use BioUtil::Seq;

SUBROUTINES/METHODS

FastaReader

FastaReader is a fasta file parser using closure. FastaReader returns an anonymous subroutine, when called, it return a fasta record which is reference of an array containing fasta header and sequence.

FastaReader could also read from STDIN when the file name is "STDIN" or "stdin".

A boolean argument is optional. If set as "true", spaces including blank, tab, "return" ("\r") and "new line" ("\n") symbols in sequence will not be trimed.

FastaReader speeds up by utilizing the special Perl variable $/ (set to "\n>"), with kind help of Mario Roy, author of MCE (https://code.google.com/p/many-core-engine-perl/). A lot of optimizations were also done by him.

Example:

   # do not trim the spaces and \n
   # $not_trim = 1;
   # my $next_seq = FastaReader("test.fa", $not_trim);
   
   # read from STDIN
   # my $next_seq = FastaReader('STDIN');
   
   # read from file
   my $next_seq = FastaReader("test.fa");

   while ( my $fa = &$next_seq() ) {
       my ( $header, $seq ) = @$fa;

       print ">$header\n$seq\n";
   }

read_sequence_from_fasta_file

Read all sequences from fasta file.

Example:

    my $seqs = read_sequence_from_fasta_file($file);
    for my $header (keys %$seqs) {
        my $seq = $$seqs{$header};
        print ">$header\n$seq\n";
    }

write_sequence_to_fasta_file

Example:

    my $seq = {"seq1" => "acgagaggag"};
    write_sequence_to_fasta_file($seq, "seq.fa");

format_seq

Format sequence to readable text

Example:

    printf ">%s\n%s", $head, format_seq($seq, 60);

validate_sequence

Validate a sequence.

Legale symbols:

    DNA: ACGTRYSWKMBDHV
    RNA: ACGURYSWKMBDHV
    Protein: ACDEFGHIKLMNPQRSTVWY
    gap and space: - *.

Example:

    if (validate_sequence($seq)) {
        # do some thing
    }

complement

Complement sequence

IUPAC nucleotide code: ACGTURYSWKMBDHVN

http://droog.gs.washington.edu/parc/images/iupac.html

    code    base    Complement
    A   A   T
    C   C   G
    G   G   C
    T/U T   A

    R   A/G Y
    Y   C/T R
    S   C/G S
    W   A/T W
    K   G/T M
    M   A/C K

    B   C/G/T   V
    D   A/G/T   H
    H   A/C/T   D
    V   A/C/G   B

    X/N A/C/G/T X
    .   not A/C/G/T
     or-    gap

my $comp = complement($seq);

revcom

Reverse complement sequence

my $recom = revcom($seq);

base_content

Example:

    my $gc_cotent = base_content('gc', $seq);

degenerate_seq_to_regexp

Translate degenerate sequence to regular expression

match_regexp

Find all sites matching the regular expression.

See https://github.com/shenwei356/bio_scripts/blob/master/sequence/fasta_locate_motif.pl

dna2peptide

Translate DNA sequence into a peptide

codon2aa

Translate a DNA 3-character codon to an amino acid

generate_random_seqence

Example:

    my @alphabet = qw/a c g t/;
    my $seq = generate_random_seqence( \@alphabet, 50 );

shuffle sequences

Example:

    shuffle_sequences($file, "$file.shuf.fa");

rename_fasta_header

Rename fasta header with regexp.

Example:

    # delete some symbols
    my $n = rename_fasta_header('[^a-z\d\s\-\_\(\)\[\]\|]', '', $file, "$file.rename.fa");
    print "$n records renamed\n";

clean_fasta_header

Rename given symbols to repalcement string. Because, some symbols in fasta header will cause unexpected result.

Example:

    my  $file = "test.fa";
    my $n = clean_fasta_header($file, "$file.rename.fa");
    # replace any symbol in (\/:*?"<>|) with '', i.e. deleting.
    # my $n = clean_fasta_header($file, "$file.rename.fa", '',  '\/:*?"<>|');
    print "$n records renamed\n";