The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::MinPerfHashTwoLevel - construct a "two level" minimal perfect hash

SYNOPSIS

  use Algorithm::MinPerfHashTwoLevel;
  my $buckets= Algorithm::MinPerfHashTwoLevel->compute(\%source_hash);

DESCRIPTION

This module implements an algorithm to construct (relatively efficiently) a minimal perfect hash using the "two level" algorithm. A perfect hash is one which has no collisions for any keys, a minimal perfect hash has exactly the same number of buckets as it has keys. The "two level" algorithm involves computing two hash values for each key. The first is used as an initial lookup into the bucket array to find a mask value which is used to modify the second hash value to determine the actual bucket to read from. This module computes the appropriate mask values.

In this implementation only one 64 bit hash value is computed, but the high and low 32 bits are used as if there were two hash values. The hash function used is stadtxhash. (The full 64 bit hash is called h0, the high 32 bits are called h1, and the low 32 bits are called h2.)

Computing the hash and mask is done in C (via XS).

The process for looking up a value in a two level hash with n buckets is as follows:

    0. compute the h0 for the key. (giving: h1 = h0 >> 32; h2 = h0 & 0xFFFFFFFF;)
    1. compute idx1 = h1 % n;
    2. find the xor_val for bucket[idx1]
    3. if the xor_val is zero we are done, the key is not in the hash
    4. compute idx2:
        if variant > 0 and int(xor_val) < 0
            idx2 = -xor_val-1
        else
            idx2 = INTHASH(h2 ^ xor_val) % n;
    5. compare the key data associated with bucket[idx2] with the key provided
    6. if they match return the desired value, otherwise the key is not in the hash.

In essence this module performs the task of computing the xor_val for each bucket such that the idx2 for every element is unique, it does it in C/XS so that it is fast.

The INTHASH() function used depends by variant, with variant 2 and later it is:

    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = ((x >> 16) ^ x);

which is just a simple 32 bit integer hash function I found at https://stackoverflow.com/a/12996028, but any decent reversible integer hash function would do. For variant 0 and 1 it is the identity function. The default is variant 3.

*NOTE* in Perl a given string may have differing binary representations if it is encoded as utf8 or not. This module uses the same conventions as Perl itself, which is that keys are stored in their minimal form when possible, and are only stored in their unicode (utf8) form when they cannot be downgraded to latin-1. This ensures that the unicode and latin-1 representations of a given string are treated as the same key. This module deals with this by "normalizing" the keys and values into latin-1, but tracking the representation as a flag. See key_normalized and key_is_utf8 (and their 'val' equivalents) documented in the construct method.

METHODS

new

Construct a new Algorithm::MinPerfHashTwoLevel object. Optional arguments which may be provided are 'source_hash' which is a hash reference to use as the source for the minimal perfect hash, 'seed' which is expected to be a 16 byte string, and 'debug' which is expected to be 0 or 1, as well as variant, which may be 0, 1 or 2. The default is 2.

compute

Compute the buckets for a two level minimal perfect hash. Either operates on the 'source_hash' passed into the constructor, or requires one to passed in as an argument.

Returns an array of hashes containing information about each bucket:

          {
            "h1_keys" => 2,
            "h0" => "17713559403787135240",
            "idx" => 0,
            "key" => "\x{103}",
            "key_is_utf8" => 1,
            "key_normalized" => "\304\203",
            "val" => "\x{103}",
            "val_is_utf8" => 1,
            "val_normalized" => "\304\203",
            "xor_val" => 2
          },

The meaning of these keys is as follows:

h1_keys

The number of keys which collide into this bucket and which are disambiguated by the 'xor_val' for this bucket.

h0

The hash value computed for this key.

idx

The index of this bucket.

key

The key for this bucket as a perl string. (See key_normalized.)

key_is_utf8

Whether this key is encoded as utf8. Will be one of 0 for "not utf8", 1 for "is utf8", and 2 for "was utf8" meaning the key is stored as latin-1, but will be upgraded when fetched.

key_normalized

The raw bytes of the normalized key. (See key_is_utf8.)

val

The value for this bucket as a perl string. (See val_normalized.)

val_is_utf8

Whether this key is encoded as utf8. Will be either 0 for "not utf8" or 1 for "is utf8".

val_normalized

The raw bytes of the normalized key. (See val_is_utf8).

xor_val

The mask to be xor'ed with the second hash (h2) to determine the actual lookup bucket. If the xor_val for a given bucket is 0 then the key is not in the hash.

EXPORT

None by default.

SEE ALSO

Tie::Hash::MinPerfHashTwoLevel::OnDisk

AUTHOR

Yves Orton

COPYRIGHT AND LICENSE

Copyright (C) 2019 by Yves Orton

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.18.4 or, at your option, any later version of Perl 5 you may have available.