NAME
BioUtil::Seq - Utilities for sequence
Some great modules like BioPerl provide many robust solutions. However, it is not easy to install for someone in some platforms. And for some simple task scripts, a lite module may be a good choice. So I reinvented some wheels and added some useful utilities into this module, hoping it would be helpful.
VERSION
Version 2015.0309
EXPORT
FastaReader
read_sequence_from_fasta_file
write_sequence_to_fasta_file
format_seq
validate_sequence
complement
revcom
base_content
degenerate_seq_to_regexp
match_regexp
dna2peptide
codon2aa
generate_random_seqence
shuffle_sequences
rename_fasta_header
clean_fasta_header
SYNOPSIS
use BioUtil::Seq;
SUBROUTINES/METHODS
FastaReader
FastaReader is a fasta file parser using closure. FastaReader returns an anonymous subroutine, when called, it return a fasta record which is reference of an array containing fasta header and sequence.
FastaReader could also read from STDIN when the file name is "STDIN" or "stdin".
A boolean argument is optional. If set as "true", spaces including blank, tab, "return" ("\r") and "new line" ("\n") symbols in sequence will not be trimed.
FastaReader speeds up by utilizing the special Perl variable $/ (set to "\n>"), with kind help of Mario Roy, author of MCE (https://code.google.com/p/many-core-engine-perl/). A lot of optimizations were also done by him.
Example:
# do not trim the spaces and \n
# $not_trim = 1;
# my $next_seq = FastaReader("test.fa", $not_trim);
# read from STDIN
# my $next_seq = FastaReader('STDIN');
# read from file
my $next_seq = FastaReader("test.fa");
while ( my $fa = &$next_seq() ) {
my ( $header, $seq ) = @$fa;
print ">$header\n$seq\n";
}
read_sequence_from_fasta_file
Read all sequences from fasta file.
Example:
my $seqs = read_sequence_from_fasta_file($file);
for my $header (keys %$seqs) {
my $seq = $$seqs{$header};
print ">$header\n$seq\n";
}
write_sequence_to_fasta_file
Example:
my $seq = {"seq1" => "acgagaggag"};
write_sequence_to_fasta_file($seq, "seq.fa");
format_seq
Format sequence to readable text
Example:
printf ">%s\n%s", $head, format_seq($seq, 60);
validate_sequence
Validate a sequence.
Legale symbols:
DNA: ACGTRYSWKMBDHV
RNA: ACGURYSWKMBDHV
Protein: ACDEFGHIKLMNPQRSTVWY
gap and space: - *.
Example:
if (validate_sequence($seq)) {
# do some thing
}
complement
Complement sequence
IUPAC nucleotide code: ACGTURYSWKMBDHVN
http://droog.gs.washington.edu/parc/images/iupac.html
code base Complement
A A T
C C G
G G C
T/U T A
R A/G Y
Y C/T R
S C/G S
W A/T W
K G/T M
M A/C K
B C/G/T V
D A/G/T H
H A/C/T D
V A/C/G B
X/N A/C/G/T X
. not A/C/G/T
or- gap
my $comp = complement($seq);
revcom
Reverse complement sequence
my $recom = revcom($seq);
base_content
Example:
my $gc_cotent = base_content('gc', $seq);
degenerate_seq_to_regexp
Translate degenerate sequence to regular expression
match_regexp
Find all sites matching the regular expression.
See https://github.com/shenwei356/bio_scripts/blob/master/sequence/fasta_locate_motif.pl
dna2peptide
Translate DNA sequence into a peptide
codon2aa
Translate a DNA 3-character codon to an amino acid
generate_random_seqence
Example:
my @alphabet = qw/a c g t/;
my $seq = generate_random_seqence( \@alphabet, 50 );
shuffle sequences
Example:
shuffle_sequences($file, "$file.shuf.fa");
rename_fasta_header
Rename fasta header with regexp.
Example:
# delete some symbols
my $n = rename_fasta_header('[^a-z\d\s\-\_\(\)\[\]\|]', '', $file, "$file.rename.fa");
print "$n records renamed\n";
clean_fasta_header
Rename given symbols to repalcement string. Because, some symbols in fasta header will cause unexpected result.
Example:
my $file = "test.fa";
my $n = clean_fasta_header($file, "$file.rename.fa");
# replace any symbol in (\/:*?"<>|) with '', i.e. deleting.
# my $n = clean_fasta_header($file, "$file.rename.fa", '', '\/:*?"<>|');
print "$n records renamed\n";