Gzip::BinarySearch - binary search a sorted, gzipped flatfile database
use Gzip::BinarySearch qw(tsv_column); my $db = Gzip::BinarySearch->new( file => 'file.gz', key_func => tsv_column(2), ); print $db->find($key); print for $db->find_all($key);
This module can binary search gzipped databases, such as TSVs, without decompressing the entire file. You need only declare how the file is sorted.
Behind the scenes, we use Gzip::RandomAccess to perform the random-access decompression.
You may pass index_file, index_span and cleanup arguments, which are passed directly to Gzip::RandomAccess, so check that module's perldoc for more info.
index_file
index_span
cleanup
Path to the gzip file you want to search.
A function that takes a line (aliased to $_) and should return the key for that line, which will be used when comparing lines.
$_
For TSVs, you can use tsv_column to generate a key function (see below).
tsv_column
A function that accepts two keys, $a and $b, and returns a value indicating which is 'greater' in the same way as Perl's sort builtin. This must match the file's natural ordering (or else).
$a
$b
sort
Providing an estimate of the maximum line length in the gzip file can help Gzip::BinarySearch know how much data to uncompress. The default is 512 bytes - getting it wrong will affect speed, but it'll still work.
How many bytes to search either side of a matching line to find adjacent matching lines when using find_all. If you have a lot of rows with the same key, upping this value will speed things up. The default is 4096 bytes.
find_all
Return the line matching the key supplied, or nothing (undef/empty list) if nothing found.
Return all lines matching the key supplied, or an empty list if none found. The lines will be returned in the order they appear in the file.
Returns the Gzip::RandomAccess object we're using.
Returns a key function that will parse each line as a TSV and return the specified column number as a key.
Returns a key function that will split a line by the field separator provided, and return the specified column number. ($field_separator may be a regex or string).
$field_separator
For example, to split like awk(1) and use the first column:
key_func => fs_column(qr/\s+/, 1)
Accessors for constructor arguments.
Currently only works with Linux line endings (ASCII 0x10).
Does not support fancy multibyte encodings (specifically UTF-8) but I aim to add support in a later release.
Isn't as efficient as it could be - aligning decompression to the indexed points in the gzip would help, as would caching decompressed blocks.
Richard Harris <richardjharris@gmail.com>
To install Gzip::BinarySearch, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Gzip::BinarySearch
CPAN shell
perl -MCPAN -e shell install Gzip::BinarySearch
For more information on module installation, please visit the detailed CPAN module installation guide.