The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Compress::BGZF::Writer - Performs blocked GZIP (BGZF) compression

SYNOPSIS

    use Compress::BGZF::Writer;

    # Use as filehandle
    my $fh_bgz = Compress::BGZF::Writer->new_filehandle( $bgz_filename );
    print ref($writer), "\n"; # prints 'GLOB'
    while ( my $chunk = generate_data() ) {
        print {$fh_bgz} $chunk;
    }
    close $fh_bgz;

    # Use as object
    my $writer = Compress::BGZF::Writer->new( $bgz_filename );
    print ref($writer), "\n"; # prints 'Compress::BGZF::Writer'
    while ( my ($id,$content) = generate_record() ) {
        my $virt_offset = $writer->add_data( $content );
        my $content_len = length $content;
        print {$idx_file} "$id\t$virt_offset\t$content_len\n";
    }
    $writer->finalize(); # flush remaining buffer;

DESCRIPTION

Compress::BGZF::Writer is a module for writing blocked GZIP (BGZF) files from any input. There are two main modes of construction - as an object (using new()) and as a filehandle glob (using new_filehandle). The filehandle mode is straightforward for general use. The object mode is useful for tracking the virtual offsets of data chunks as they are added (for instance, for generation of a custom index).

METHODS

Filehandle Functions

new_filehandle
    my $fh_out = Compress::BGZF::Writer->new_filehandle();
    my $fh_out = Compress::BGZF::Writer->new_filehandle( $output_fn );

Create a new Compress::BGZF::Writer engine and tie it to a IO::File handle, which is returned. Takes an optional single argument for the filename to be written to (defaults to STDOUT).

print
close
    print {$fh_out} $some_data;
    close $fh_out;

These functions emulate the standard perl functions of the same name.

0bject-oriented Methods

new
    my $writer = Compress::BGZF::Writer->new();
    my $writer = Compress::BGZF::Writer->new( $output_fn );

Create a new Compress::BGZF::Writer engine. Takes an optional single argument for the filename to be written to (defaults to STDOUT).

set_level
    $writer->set_level( $compression_level );

Set the DEFLATE compression level to use (0-9). Available constants include Z_NO_COMPRESSION, Z_BEST_SPEED, Z_DEFAULT_COMPRESSION, Z_BEST_COMPRESSION (defaults to Z_DEFAULT_COMPRESSION). The author's observations suggest that the default is reasonable unless speed is of the essence, in which case setting a level of 1-2 can sometimes halve the compression time.

set_write_eof
    $writer->set_write_eof;    # turn on
    $writer->set_write_eof(1); # turn on
    $writer->set_write_eof(0); # turn off

The htslib bgzf.c library, which might be considered the reference BGZF implementation, uses a special empty block to indicate EOF as an extra check of file integrity. This class method turns on or off a flag telling the Compress::BGZF::Writer object whether to append this special block to the output file for the sake of compatability. Default: off.

add_data
    $writer->add_data( $content );

Adds a block of conent to the write buffer. Actual compression/writes take place as the buffer reaches the target size (64k minus header/footer space). Returns the virtual offset to the start of the data added.

finalize
    $writer->finalize();

Write any remaining buffer contents. While this method should be automatically called during cleanup of the Compress::BGZF::Writer object, it is probably safer to call it explicitly to avoid unexpected behavior. Keep in mind that if both you and the object destruction process fail to call this, you will almost certainly generate an incomplete file (and probably won't notice since it will still be valid BGZF).

write_index
    $writer->write_index( $index_fn );

Write offset index to the specified file. Index format (as defined by htslib) consists of little-endian int64-coded values. The first value is the number of offsets in the index. The rest of the values consist of pairs of block offsets relative to the compressed and uncompressed data. The first offset (always 0,0) is not included.

Note that calling write_index() will also call finalize() and so should always be called after all data has been queued for write (it is hard to imagine a case where this would not be the desirable behavior).

For small(er) files (up to a few hundred MB) on-the-fly index generation with Compress::BGZF::Reader is relatively fast and an on-disk index is probably not necessary. For larger files, storing a paired index file can signficantly decrease initialization times for Compress::BGZF::Reader objects.

These index files should be fully compatible with the htslib bgzip tool.

CAVEATS AND BUGS

This is code is in alpha testing stage. The filehandle behavior should not change in the future, but the object-oriented API is not guaranteed to be stable.

Please reports bugs to the author.

AUTHOR

Jeremy Volkening <jeremy *at* base2bio.com>

COPYRIGHT AND LICENSE

Copyright 2015 Jeremy Volkening

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.