The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Gzip::BinarySearch - binary search a sorted, gzipped flatfile database

SYNOPSIS

  use Gzip::BinarySearch qw(tsv_column);

  my $db = Gzip::BinarySearch->new(
      file => 'file.gz',
      key_func => tsv_column(2),
  );
  print $db->find($key);
  print for $db->find_all($key);

DESCRIPTION

This module can binary search gzipped databases, such as TSVs, without decompressing the entire file. You need only declare how the file is sorted.

Behind the scenes, we use Gzip::RandomAccess to perform the random-access decompression.

METHODS

new (%args)

You may pass index_file, index_span and cleanup arguments, which are passed directly to Gzip::RandomAccess, so check that module's perldoc for more info.

file (required)

Path to the gzip file you want to search.

key_func (default: first field, whitespace-separated)

A function that takes a line (aliased to $_) and should return the key for that line, which will be used when comparing lines.

For TSVs, you can use tsv_column to generate a key function (see below).

cmp_func (default: Perl's 'cmp' operator)

A function that accepts two keys, $a and $b, and returns a value indicating which is 'greater' in the same way as Perl's sort builtin. This must match the file's natural ordering (or else).

est_line_length

Providing an estimate of the maximum line length in the gzip file can help Gzip::BinarySearch know how much data to uncompress. The default is 512 bytes - getting it wrong will affect speed, but it'll still work.

surrounding_lines_blocksize

How many bytes to search either side of a matching line to find adjacent matching lines when using find_all. If you have a lot of rows with the same key, upping this value will speed things up. The default is 4096 bytes.

find ($key)

Return the line matching the key supplied, or nothing (undef/empty list) if nothing found.

find_all ($key)

Return all lines matching the key supplied, or an empty list if none found. The lines will be returned in the order they appear in the file.

gzip

Returns the Gzip::RandomAccess object we're using.

EXPORTED FUNCTIONS

tsv_column ($column_number)

Returns a key function that will parse each line as a TSV and return the specified column number as a key.

fs_column ($field_separator, $column_number)

Returns a key function that will split a line by the field separator provided, and return the specified column number. ($field_separator may be a regex or string).

For example, to split like awk(1) and use the first column:

  key_func => fs_column(qr/\s+/, 1)

est_line_length

surrounding_lines_blocksize

Accessors for constructor arguments.

CAVEATS

Currently only works with Linux line endings (ASCII 0x10).

Does not support fancy multibyte encodings (specifically UTF-8) but I aim to add support in a later release.

Isn't as efficient as it could be - aligning decompression to the indexed points in the gzip would help, as would caching decompressed blocks.

AUTHOR

Richard Harris <richardjharris@gmail.com>