The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Tie::Hash::MinPerfHashTwoLevel::OnDisk - construct or tie a "two level" minimal perfect hash based on disk

SYNOPSIS

  use Tie::Hash::MinPerfHashTwoLevel::OnDisk;

  Tie::Hash::MinPerfHashTwoLevel::OnDisk->make_file(
    file => $some_file,
    source_hash => $some_hash,
    comment => "this is a comment",
    debug => 0,
  );

  my %hash;
  tie %hash, "Tie::Hash::MinPerfHashTwoLevel::OnDisk", $some_file;

DESCRIPTION

This module allows one to either construct, or use a precomputed minimal perfect hash on disk via tied interface. The disk image of the hash is loaded by using mmap, which means that multiple processes may use the same disk image at the same time without memory duplication. The hash is readonly, and may only contain string values.

METHODS

make_file

Construct a new file from a given 'source_hash' argument. The constructed buffer is written to the file specified by the 'file' argument. A comment may be added to the file via the 'comment' argument, note that comments may not contain null characters, although keys and value may. A predetermined seed may be provided to the hash function (16 bytes) via the 'seed' argument, however note that if it does not produce hash values that allow for the construction of a valid two level perfect hash then a different seed will be automatically selected (this will not affect the ability to use the constructed hash, it just may not be deterministic). The 'debug' argument outputs some basic status infromation about the construction process.

validate_file

Validate the file specified by the 'file' argument. Returns a list of two values, 'variant' and 'message'. If the file fails validation the 'variant' will be undef and the 'message' will contain an error message. If the file passes validation the 'variant' will specify the variant of the file (currently only 0 is valid), and 'message' will contain some basic information about the file, such as how many keys it contains, the comment it was created with, etc.

FILE FORMAT

Currently there is only one file format, variant 0.

The file structure consists of a header, followed by a byte vector of seed/state data for the hash function, followed by a bucket table with records of a fixed size, followed by a bitvector of the flags for the keys with two bits per key, followed by a bitvector of flags for values with one bit per value, followed by a string table containing the comment for the file and the strings it contains. The key flags may be 0 for "latin-1/not-utf8", 1 for "is-utf8", and 2 for "was-utf8" which is used for keys which can be represented as latin-1, but should be restored as unicode/utf8. The val flags are similar but do not (need to) support "was-utf8".

Structure:

    Header
    Hash-state
    Bucket-table
    Key flags
    Val flags
    Strings

Header:

    U32 magic_num       -> 1278363728 -> "PH2L"
    U32 variant         -> 0
    U32 num_buckets     -> number of buckets/keys in hash
    U32 state_ofs       -> offset in file where hash preseeded state is found
    U32 table_ofs       -> offset in file where bucket table starts
    U32 key_flags_ofs   -> offset in file where key flags are located
    U32 val_flags ofs   -> offset in file where val flags are located
    U32 str_buf_ofs     -> offset in file where strings are located
    U64 table_checksum  -> hash value checksum of table and key/val flags
    U64 str_buf_checksum-> hash value checksum of string data

All "_ofs" members in the header are aligned on 16 byte boundaries and may be right padded with nulls if necessary to make them a multiple of 16 bytes long, including the string buffer.

The string buffer contains the comment at str_buf_ofs+1, its length can be found with strlen(), the comment may NOT contain nulls, and will be null terminated. All other strings in the table are NOT null padded, the length data stored in the bucket records should be used to determine the length of the keys and values.

The table_checksum is the hash (using the seed/state data stored at state_ofs) of the data in the file from table_ofs to str_buf_ofs, eg it includes the key_flags bit vector and val_flags bit vector. The str_buf_checksum is similar but of the data from the str_buf_ofs to the end of the file.

Buckets:

   U32 xor_val      -> the xor_val for this bucket's h1 lookups (0 means none)
   U32 key_ofs      -> offset from str_buf_ofs to find this key (nonzero always)
   U32 val_ofs      -> offset from str_buf_ofs to find this value (0 means undef)
   U16 key_len      -> length of key
   U16 val_len      -> length of value

The hash function used is stadtx hash, which uses a 16 byte seed to produce a 32 byte state vector.

EXPORT

None by default.

SEE ALSO

Algorithm::MinPerfHashTwoLevel

AUTHOR

Yves Orton

COPYRIGHT AND LICENSE

Copyright (C) 2019 by Yves Orton

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.18.4 or, at your option, any later version of Perl 5 you may have available.