NAME
PerlIO::via::SeqIO - PerlIO layer for biological sequence formats
SYNOPSIS
use
PerlIO::via::SeqIO;
# open a FASTA file for reading:
open
(
my
$f
,
"<:via(SeqIO)"
,
'my.fas'
);
# open an EMBL file for writing
open
(
my
$e
,
">:via(SeqIO::embl)"
,
'my.embl'
);
# convert
$e
$_
while
(<
$f
>);
# add comments (this really works)
while
(<
$f
>) {
# get the real sequence object
my
$seq
= O(
$_
);
if
(
$seq
->desc =~ /Pongo/) {
$e
"# this one is almost human..."
;
}
$e
$_
;
}
# a one-liner, sort of
$ alias scvt=
"perl -Ilib \"-MPerlIO::via::SeqIO qw(open)\" -e \"open(STDIN, '<:via(SeqIO)'); open(STDOUT, '>:via(SeqIO::'.shift().')'); while (<STDIN>) { print }\""
$ cat
my
.fas | scvt gcg >
my
.gcg
DESCRIPTION
PerlIO::via::SeqIO
attempts to provide an easy option for harnessing the magic sequence format I/O of the BioPerl (http://bioperl.org) toolkit. Opening a biological sequence file under via(SeqIO)
yields a filehandle that can be used to read and write Bio::Seq objects sequentially with an absolute minimum of setup code.
via(SeqIO)
also allows the user to mix plain text and sequence formats on a single filehandle transparently. Different sequence formats can be written to a single file by a simple filehandle tweak.
DETAILS
- Basics
-
Here's the basic idea, in code converting FASTA to EMBL format:
open
(
$in
,
'<:via(SeqIO)'
,
'my.fas'
);
open
(
$out
,
'>:via(SeqIO::embl)'
,
'my.embl'
);
while
(<
$in
>) {
print
$out
$_
;
}
- Specifying sequence formats (or not)
-
On reading, you can rely on Bio::SeqIO's format guesser by invoking an unqualifed
open
(
$in
,
'<:via(SeqIO)'
,
'mystery.txt'
);
or you can specify the format, like so:
open
(
$in
,
'<:via(SeqIO::embl)'
,
'mystery.txt'
);
On writing, a qualified invocation is required;
open
(
$out
,
'>:via(SeqIO)'
,
'my.fas'
);
# throws
open
(
$out
,
'>:via(SeqIO::fasta)'
,
'my.fas'
);
# that's better
- Retrieving the sequence object itself
-
This does what you mean:
open
(
$in
,
'<:via(SeqIO)'
,
'my.fas'
);
open
(
$out
,
'>:via(SeqIO::embl)'
,
'my.embl'
);
while
(<
$in
>) {
print
$out
$_
;
}
However,
$_
here is not the sequence object itself. To get that use the all-purpose object getter O():while
(<
$in
>) {
print
join
(
"\t"
, O(
$_
)->id, O(
$_
)->desc),
"\n"
;
}
If you
then this DWYM:
while
(<
$in
>) {
print
O->id;
}
- Writing a de novo sequence object
-
Use the T() mapper to convert a Bio::Seq object into a thing that can be formatted by
via(SeqIO)
:open
(
$seqfh
,
">:via(SeqIO::embl)"
,
"my.embl"
);
my
$result
= Bio::SearchIO->new(
-file
=>
'my.blast'
)->next_result;
while
(
my
$hit
=
$result
->next_hit()){
while
(
my
$hsp
=
$hit
->next_hsp()){
my
$aln
=
$hsp
->get_aln;
print
$seqfh
T(
$_
)
for
(
$aln
->each_seq);
}
}
- Writing plain text
-
Interspersing plain text among your sequences is easy; just print the desired text to the handle. See the "SYNOPSIS".
Even the following works:
open
(
$in
,
"<:via(SeqIO)"
,
'my.fas'
)
open
(
$out
,
">:via(SeqIO::embl)"
,
'annotated.txt'
);
$seq
= <
$in
>;
print
$out
"In EMBL format, the sequence would be rendered:"
,
$s
;
- Pipe through a gzip layer
-
You can use the Perlio layer PerlIO::via::gzip to decompress and compress via(SeqIO) input and output.
Compressed output:
open
(
my
$tfh
,
"<:via(SeqIO)"
,
"test.fas"
);
open
(
my
$zfh
,
'>:via(SeqIO::embl):via(gzip)'
,
'test.embl.gz'
);
while
(<
$tfh
>) {
print
$zfh
$_
;
}
close
(
$zfh
);
GOTCHA: the
close
is required.Decompressed input:
open
(
$tfh
,
"<:via(gzip):via(SeqIO::fasta)"
,
"test.fas.gz"
);
open
(
my
$zfh
,
'>:via(SeqIO::embl)'
,
'test.embl'
);
while
(<
$tfh
>) {
print
$zfh
$_
;
}
When reading via gzip, the sequence format must be explicitly specified in the
via(SeqIO)
mode spec.Conversion, gzip to gzip:
open
(
my
$tfh
,
"<:via(gzip):via(SeqIO::fasta)"
,
"test.fas.gz"
);
open
(
my
$zfh
,
">:via(gzip):via(SeqIO::embl)"
,
"test.embl.gz"
);
local
$/;
print
$zfh
<
$tfh
>;
close
(
$zfh
);
- Redirecting STDIN/STDOUT/DATA through
via(SeqIO)
-
Import the
open()
function provided by the module, like soThis will provide the following kind of two-argument
open
functionalityopen
(STDIN,
'<:via(SeqIO)'
);
open
(STDOUT,
'>:via(SeqIO::gcg)'
);
while
(<STDIN>) {
print
;
}
which will allow
cat
my
.gcg | perl your.pl > out
your.pl
can read STDIN and acquire the sequence objects by using the object getter O():open
(STDIN,
'<:via(SeqIO)'
);
while
(<STDIN>) {
$seqobj
= O(
$_
);
...
}
The format of the input in this case will be guessed by the
Bio::SeqIO
machinery.The imported
open()
should pass through other uses ofopen
unharmed. This is tested in001_passthru.t
. Please ping the "AUTHOR" if there are issues. - Switching write formats
-
You can also easily switch write formats. (Why? Because...who knows?) Use set_write_format right off the handle:
open
(
$in
,
"<:via(SeqIO)"
,
'my.fas'
)
open
(
$out
,
">:via(SeqIO::embl)"
,
'multi.txt'
);
$seq1
= <
$in
>;
print
"This is sequence 1 in embl format:\n"
;
print
$out
$seq1
;
$out
->set_write_format(
'gcg'
);
print
$out
"while this is sequence 1 in GCG format:\n"
print
$out
$seq1
;
- Supported Formats
-
The supported formats are contained in
@PerlIO::via::SeqIO::SUPPORTED_FORMATS
. Currently they arefasta, embl, gcg, genbank, pir
UTILITIES
The O()
and T()
methods are exported by default.
The open
hook needs to be available for the 2-argument open
redirections (see "DETAILS") to work. Do
O()
Title : O
Usage :
$o
= O(
$sym
)
# not an object method
Function: get the object
"represented"
by the argument
Returns : the right object
Args : PerlIO::via::SeqIO GLOB, or
*PerlIO::via::SeqIO::TFH
(
tied
fh) or
scalar
string (
sprintf
-rendered Bio::SeqI object)
Example :
$seqobj
= O(
$s
= <
$seqfh
>);
T()
Title : T
Usage : T(
$seqobj
)
# not an object method
Function: Transform a real Bio::Seq object to a
via(SeqIO)-writeable thing
Returns : A thing writeable as a formatted sequence
by a via(SeqIO) filehandle
Args : a[n array of] Bio::Seq or related object[s]
Example :
$seqfh
T(
$seqobj
);
set_write_format()
Title : set_write_format
Usage :
$fh
->set_write_format(
$format
)
Function: Set a
write
handle to
write
a specified
sequence
format
Returns : true on success
Args :
scalar
string; a supported
format
(see
@PerlIO::via::SeqIO::SUPPORTED_FORMATS
)
Note : call off filehandle directly
SEE ALSO
PerlIO, PerlIO::via, Bio::SeqIO, Bio::Seq, http://bioperl.org