The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

minhash_cmp - uses MinHash & SpeedyFx to compare large text data

VERSION

version 0.005

SYNOPSIS

    minhash_cmp [options] FILE1 FILE2

DESCRIPTION

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

OPTIONS

--help

This.

--binmode

You can use this to set the file to be read in binary mode (:raw), :utf8, etc. Default: :utf8

--epsilon

Expected error value used to compute the number of different hash functions (default: 0.05).

--k

Number of different hash functions to use (default: 400; overrides --epsilon).

--seed

Custom seed (integer).

CAVEATS

Uses MANY RAM!!! Each initialized hash function wastes ~2MB.

SEE ALSO

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.