NAME
Text::SpeedyFx - tokenize/hash large amount of strings efficiently
VERSION
version 0.006
SYNOPSIS
use Data::Dumper;
use Text::SpeedyFx;
my $sfx = Text::SpeedyFx->new;
my $words_bag = $sfx->hash('To be or not to be?');
print Dumper $words_bag;
#$VAR1 = {
# '1422534433' => '1',
# '4120516737' => '2',
# '1439817409' => '2',
# '3087870273' => '1'
# };
my $feature_vector = $sfx->hash_fv("thats the question", 8);
print unpack('b*', $feature_vector);
# 01001000
DESCRIPTION
XS implementation of a very fast combined parser/hasher which works well on a variety of bag-of-word problems.
Original implementation is in Java and was adapted for a better Unicode compliance.
METHODS
new([$seed, $bits])
Initialize parser/hasher, can be customized with the options:
$seed
-
Hash seed (default: 1).
$bits
-
How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses Basic Multilingual, Supplementary Multilingual and Supplementary Ideographic planes. See also "UNICODE SUPPORT"
hash($octets)
Parses $octets
and returns a hash reference where keys are the hashed tokens and values are their respective count. $octets
are assumed to represent UTF-8 string unless Text::SpeedyFx is instantiated with "$bits" == 8 (which forces Latin-1 mode, see "UNICODE SUPPORT"). Note that this is the slowest form due to the (computational) complexity of the Perl hash structure itself: hash_fv()
is 350% faster, while hash_min()
is up to 400% faster.
hash_fv($octets, $n)
Parses $octets
and returns a feature vector (string of bits) with length $n
. $n
is supposed to be a multiplier of 8, as the length of the resulting feature vector is ceil($n / 8)
. See the included utilities cosine_sim and uniq_wc.
hash_min($octets)
Parses $octets
and returns the hash with the lowest value. Useful in MinHash implementation. See also the included minhash_cmp utility.
UNICODE SUPPORT
Due to the nature of Perl, Unicode support is handled differently from the original implementation. By default, Text::SpeedyFx recognizes UTF-8 encoded code points in the range 00000-2FFFF:
Plane 0, the Basic Multilingual Plane (BMP, 0000–FFFF)
Plane 1, the Supplementary Multilingual Plane (SMP, 10000–1FFFF)
Plane 2, the Supplementary Ideographic Plane (SIP, 20000–2FFFF)
There are planes up to 16; however, as in Perl v5.16.2, there are no code points matching
isALNUM_utf8()
there (so it's irrelevant for proper algorithm operation).
Although, there is a major drawback: in this mode, each instance allocates up to 1 MB of memory.
If the application doesn't need to support code points beyond the Plane 0 (like the original SpeedyFx implementation) it is possible to constraint the address space to 16 bits, which lowers memory allocation to up to 256 KB. In fact, Text::SpeedyFx constructor accepts bit range between 8 and 18 to address code points.
LATIN-1 SUPPORT
8 bit address space has one special meaning: it completely disables multibyte support. In 8 bit mode, each instance will only allocate 256 bytes and hashing will be run up to 66% faster. Tokenization will fallback to ISO 8859-1 West European languages (Latin-1) character definitions.
BENCHMARK
The test platform configuration:
Intel® Core™ i7-2600 CPU @ 3.40GHz with 8 GB RAM;
Ubuntu 11.10 (64-bit);
Perl v5.16.2 (installed via perlbrew);
enwik8 from the Large Text Compression Benchmark.
Rate hash hash_min_utf8 hash_fv hash_min
hash 19.5 MB/s -- -68% -78% -80%
hash_min_utf8 60.2 MB/s 209% -- -34% -40%
hash_fv 90.5 MB/s 364% 50% -- -9%
hash_min 99.6 MB/s 411% 66% 10% --
All the tests except hash_min_utf8
were made in Latin-1 mode.
REFERENCES
Extremely Fast Text Feature Extraction for Classification and Indexing by George Forman and Evan Kirshenbaum
cosine_sim, minhash_cmp and uniq_wc utilities from this distribution
AUTHOR
Stanislaw Pusep <stas@sysd.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2012 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.