The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Compress::BGZF - Read/write blocked GZIP (BGZF) files

SYNOPSIS

    use Compress::BGZF::Writer;
    use Compress::BGZF::Reader;

    # create a BGZF file

    my @records = generate_data();
    
    my $fh_out = Compress::BGZF::Writer->new_filehandle( 'somefile.gz' );
    print {$fh_out} $_ for (@records);
    close $fh_out;

    # perform non-sequential reads

    my $fh_in = Compress::BGZF::Reader->new_filehandle( 'somefile.gz' );

    # read 32 bytes from uncompressed file offset 3020 
    seek $fh_in, 3020, 0;
    read $fh_in, my $buffer, 32;
    print "data: $buffer\n";

DESCRIPTION

Compress::BGZF contains a pair of modules for working with block GZIP (BGZF) files. BGZF is a specialized GZIP format that is compatible with existing GZIP tools and libraries, but which allows for fast random access at the cost of a modest increase in file size. It does this by concatenating together multiple complete GZIP blocks, each of which has a full header and footer and thus can be decompressed individually without reading through earlier parts of the file, and by including an extra field in each header that contains the size of the block. Upon creation of a Reader object, an index containing the compressed and uncompressed offsets of the start of each block is either read from disk or generated from the data itself. seek, read, and tell (or their object-oriented counterparts) can then be performed on the compressed file as if it were uncompressed. Seeks are fast, and a worst-case maximum of 64k of preceeding data will be uncompressed in order to reach the data of interest.

Selected Implementation Notes

According to the BGZF specification, each GZIP block is limited to 64kb in size (including an 18 byte header and 8 byte footer). While in theory the uncompressed size could be larger, limits of the virtual offset calculation and ease of implementation mean that this size limit is enforced on the uncompressed data.

Virtual offsets are calculated as follows: for any given position in the uncompressed file, the virtual offset is calculated from the starting byte offset A of the block in which it occurs (relative to the compressed file) and the byte offset B at which it occurs in the uncompressed payload of that block, such that VO = A << 16 | B. This single value then contains sufficient information to quickly seek to the given location and begin extracting data.

METHODS

See individual POD of Reader and Writer modules.

A demonstration is included under bin/ named "bgzip.pl" which is designed to emulate the functionality of the "bgzip" program that comes with the htslib distribution.

AUTHOR

Jeremy Volkening <jdv *at* base2bio.com>

COPYRIGHT AND LICENSE

Copyright 2015 Jeremy Volkening

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.