NAME
minhash_cmp - uses MinHash & SpeedyFx to compare large text data
VERSION
version 0.005
SYNOPSIS
minhash_cmp [options] FILE1 FILE2
DESCRIPTION
MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.
OPTIONS
- --help
-
This.
- --binmode
-
You can use this to set the file to be read in binary mode (
:raw
),:utf8
, etc. Default::utf8
- --epsilon
-
Expected error value used to compute the number of different hash functions (default: 0.05).
- --k
-
Number of different hash functions to use (default: 400; overrides
--epsilon
). - --seed
-
Custom seed (integer).
CAVEATS
Uses MANY RAM!!! Each initialized hash function wastes ~2MB.
SEE ALSO
AUTHOR
Stanislaw Pusep <stas@sysd.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2012 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.