NAME
Data::HyperLogLog::Shared - shared-memory HyperLogLog cardinality estimator for Linux
SYNOPSIS
use Data::HyperLogLog::Shared;
# default precision 14 (16384 registers, ~0.8% std error), anonymous mapping
my $hll = Data::HyperLogLog::Shared->new;
$hll->add("alice");
$hll->add("bob");
$hll->add("alice"); # already seen: no register increases
my $n = $hll->count; # ~2 (estimated distinct items)
# bulk add in a single lock acquisition
my $new = $hll->add_many([ map { "user-$_" } 1 .. 1000 ]);
# merge another estimator of equal precision (register-wise max)
my $other = Data::HyperLogLog::Shared->new;
$other->add_many([ map { "user-$_" } 500 .. 1500 ]);
$hll->merge($other);
my $union = $hll->count; # ~1500 distinct across both
# share across processes via a backing file
my $shared = Data::HyperLogLog::Shared->new("/tmp/visitors.hll", 14);
DESCRIPTION
A HyperLogLog estimator in shared memory: it counts the number of distinct items it has seen (a probabilistic distinct-count) in a fixed, tiny amount of memory, regardless of how many items pass through it. It never stores the items; it keeps only an array of m = 2**precision single-byte registers (16 KB at the default precision 14) plus a fixed ~16 KB region for the cross-process reader table, so the whole mapping is about 32 KB at precision 14 (see mmap_size in "STATS"). It yields a relative standard error of roughly 0.8%.
Each item is hashed with XXH3; the top precision bits of the hash select a register, and the position of the first set bit in the remaining bits is folded into that register as a running maximum. The harmonic mean of the registers, with a small-range linear-counting correction, gives the estimate.
Because the register array lives in a shared mapping, several processes share one estimator: any process that opens the same backing file, inherits the anonymous mapping across fork, or reopens a passed memfd, sees the others' additions and contributes its own. A write-preferring futex rwlock with dead-process recovery guards mutation, so many processes may add and count concurrently. Two estimators of equal precision can be combined with merge (register-wise max), which gives the cardinality of the union of their item sets without double-counting overlaps.
Items are added by their byte content (encode wide/utf8 strings first). Linux-only. Requires 64-bit Perl.
METHODS
Constructors
my $hll = Data::HyperLogLog::Shared->new($path, $precision);
my $hll = Data::HyperLogLog::Shared->new; # anonymous, precision 14
my $hll = Data::HyperLogLog::Shared->new(undef, 16); # anonymous, precision 16
my $hll = Data::HyperLogLog::Shared->new_memfd($name, $precision);
my $hll = Data::HyperLogLog::Shared->new_from_fd($fd);
$path is the backing file (undef or omitted for an anonymous mapping). $precision is optional and defaults to 14; it must be between 4 and 18 (new croaks otherwise). The register count is m = 2**precision (precision 4 -> 16 registers, 18 -> 262144), and the relative standard error is about 1.04 / sqrt(m) (precision 14 -> ~0.81%, 16 -> ~0.41%, 12 -> ~1.63%). When reopening an existing file or memfd, the stored precision wins and the caller's argument is ignored. new_memfd creates a Linux memfd (transferable via its memfd descriptor); new_from_fd reopens one in another process.
Adding and counting
my $bumped = $hll->add($item); # 1 if a register increased, else 0
my $added = $hll->add_many(\@items); # count of items that increased a register
my $n = $hll->count; # estimated distinct items (rounded)
$hll->clear; # reset to empty (count -> 0)
add hashes $item (taken by its bytes; wide characters croak, encode first) and updates the relevant register, returning 1 if that register increased and 0 otherwise. A return of 0 does not mean the item was seen before -- it means its hash did not beat the current register value; treat the return only as "did this change the sketch". add_many takes an array reference and does the whole batch under a single write lock, returning how many of its elements increased a register. count returns the current estimate rounded to the nearest integer.
Merging
$hll->merge($other);
Folds $other's registers into $hll by register-wise maximum, so $hll then estimates the cardinality of the union of the two item sets. Both estimators must have the same precision (merge croaks on a mismatch). $other is read under its own lock into a private snapshot first, so merging is deadlock-free even if two processes merge each other concurrently; $other is not modified.
Introspection and lifecycle
$hll->precision; $hll->registers; $hll->count; $hll->stats;
$hll->path; $hll->memfd; $hll->sync; $hll->unlink; # or Class->unlink($path)
precision is the configured precision; registers is the register count m (2**precision). sync flushes the mapping to its backing store (a no-op for anonymous and memfd estimators, which have none); unlink removes the backing file (also callable as Class->unlink($path)); path returns the backing path (undef for anonymous, memfd, or fd-reopened estimators) and memfd the backing descriptor -- the memfd of a new_memfd estimator or the dup'd fd of a new_from_fd estimator, and -1 for file-backed or anonymous estimators.
STATS
stats() returns a hashref: precision, registers (the register count m), count (the current rounded estimate), ops (running count of mutating operations -- add, add_many, merge, clear), and mmap_size (bytes of the shared mapping).
ACCURACY
The relative standard error is approximately 1.04 / sqrt(m) where m = 2**precision. At the default precision 14 that is about 0.8%; roughly two thirds of estimates fall within one standard error of the truth and almost all within three. For very small cardinalities (when the estimate is at or below 2.5 * m and some registers are still zero) the estimator switches to linear counting, which is accurate in that range. There is no large-range (2^32) correction: with 64-bit hashes the hash space is effectively collision-free for any cardinality this is used for.
SHARING ACROSS PROCESSES
The estimator lives in a shared mapping, shared the same three ways as the rest of the family: a backing file (every process calls new($path, ...) on the same path), an anonymous mapping inherited across fork, or a memfd whose descriptor is passed to an unrelated process (over a UNIX socket via SCM_RIGHTS, or via /proc/$pid/fd/$n) and reopened with new_from_fd($fd). Because the mapping is shared, every process adds into and reads from the same set of registers, so the cardinality reflects the union of what all of them have seen.
# producer and consumer share one estimator with no coordination
my $hll = Data::HyperLogLog::Shared->new; # before fork
unless (fork) { $hll->add_many([ map { "ev-$_" } 1 .. 1000 ]); exit }
wait;
print $hll->count, "\n"; # ~1000, counting the child's additions
SECURITY
The mmap region is writable by all processes that open it. Do not share backing files with untrusted processes.
CRASH SAFETY
Mutation is guarded by a futex-based write-preferring rwlock with PID-encoded ownership; if a holder dies, the next contender detects the dead owner and recovers. Each register update is a single byte store, so a crash leaves the estimator consistent up to the last completed add. Limitation: PID reuse is not detected (very unlikely in practice).
SEE ALSO
Data::Intern::Shared, Data::SortedSet::Shared, Data::SpatialHash::Shared, and the rest of the Data::*::Shared family.
AUTHOR
vividsnow
LICENSE
This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.