The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

BioX::Seq::Stream - Parse FASTA and FASTQ files sequentially

SYNOPSIS

    use BioX::Seq::Stream;

    my $parser = BioX::Seq::Stream->new; #defaults to STDIN
    my $parser = BioX::Seq::Stream->new( $filename );
    my $parser = BioX::Seq::Stream->new( $filehandle );

    while (my $seq = $parser->next_seq) {

        # $seq is a BioX::Seq object

    }

DESCRIPTION

BioX::Seq::Stream is a sequential parser for FASTA and FASTQ files. It should handle any valid input, with the exception of the use of semi-colons to indicate FASTA comments (this could be easily implemented, but I have never seen an actual FASTA file like this in the wild, and the NCBI FASTA specification does not allow for this usage). In particular, it will properly handle FASTQ files with multi-line (wrapped) sequence and quality strings. I have never seen a FASTQ file like this either, but apparently this is technically valid and a few software programs will still create files like this.

CONSTRUCTOR

new

    my $parser = BioX::Seq::Stream->new();
    my $parser = BioX::Seq::Stream->new( $filename );
    my $parser = BioX::Seq::Stream->new( $filehandle );
    my $parser = BioX::Seq::Stream->new( $filename, %args );

Create a new BioX::Seq::Stream parser. If no arguments are given (or if the first argument given has an undefined value), the parser will read from STDIN. Otherwise, the parser will determine whether a filename or a filehandle is provided and act accordingly. Returns a BioX::Seq::Stream parser object.

The first argument is always a filename or filehandle. Subsequent key/value arguments can include:

fast
    my $parser = BioX::Seq::Stream->new( $filename, fast => 1 );

In version 0.007004, a check was added during FASTA parsing which validated each sequence string. Previously, no validation had been performed for the sake of speed. The new check, while safer, results in somewhat slower parsing. It can be explictly turned off by setting this parameter to a true value. This can also be toggled explictly using the \fast() method described below.

METHODS

next_seq

    while (my $seq = $parser->next_seq()) {
        # do something
    }

Reads the next sequence from the filehandle. Returns a BioX::Seq object, or undef if the end of the file is reached.

The first time this is called, the parser will try to automatically determine the file format and throw an exception if detection fails. In practice this should seldom or never happen, as the supported file formats can be reliable distinguished based on the first few bytes of the file.

fast

    $parser->fast(1);
    $parser->fast(); # same as $parser->fast(1);
    $parser->fast(0);

Sets/unsets 'fast' mode. If a true valid is given (or no value at all), certain validation steps during parsing are disabled for the sake of speed, as described above under CONSTRUCTOR.

DECOMPRESSION

If a filename is passed to the constructor, the module will read the first four bytes and match against known file compression magic bytes. If a compressed file is suspected, and a compatible decompression program can be found in the system path, a piped filehandle is opened for reading. Currently the following formats are supported (if appropriate binaries are found):

  * GZIP

  * BZIP2

  * DSRC v2 (released versions buggy, currently not under testing!!)

  * FQZCOMP

Benchmarking indicated a fairly significant speed difference in handling decompression using external binaries vs. Perl modules, so the current implementation uses the former for decompressing on-the-fly. This may require additional work to compile to proper binaries for a given platform. This module will try to find the location of the proper binaries by their typical name. If installed using a non-standard name, the following package variables can be set:

$BioX::Seq::Stream::GZIP_BIN

By default, looks for a binary in PATH named 'pigz' or 'gzip'

$BioX::Seq::Stream::BZIP_BIN

By default, looks for a binary in PATH named 'pbzip2' or 'bzip2'

$BioX::Seq::Stream::DSRC_BIN

By default, looks for a binary in PATH named 'dsrc2' or 'dsrc'

$BioX::Seq::Stream::FQZC_BIN

By default, looks for a binary in PATH named 'fqz_comp'

CAVEATS AND BUGS

Minimal input validation is performed. FASTQ ID lines are checked for proper format and sequence and quality lengths are compared, but the contents of sequence and quality strings are not sanity-checked, nor is the FASTA sequence string.

Please reports bugs to the author.

AUTHOR

Jeremy Volkening <jeremy *at* base2bio.com>

COPYRIGHT AND LICENSE

Copyright 2014-2016 Jeremy Volkening

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.