minhash_cmp - uses MinHash & SpeedyFx to compare large text data
version 0.012
minhash_cmp [options] FILE1 FILE2
MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.
This.
Expected error value used to compute the number of different hash functions (default: 0.05).
Number of different hash functions to use (default: 400; overrides --epsilon).
--epsilon
Custom seed (integer).
How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses Basic Multilingual, Supplementary Multilingual and Supplementary Ideographic planes.
Under bits=18 setting, each initialized hash function consumes ~500KB.
bits=18
MinHash
Text::SpeedyFx
Stanislaw Pusep <stas@sysd.org>
This software is copyright (c) 2014 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install Text::SpeedyFx, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::SpeedyFx
CPAN shell
perl -MCPAN -e shell install Text::SpeedyFx
For more information on module installation, please visit the detailed CPAN module installation guide.