NAME
Text::SpeedyFx - tokenize/hash large amount of strings efficiently
VERSION
version 0.004
SYNOPSIS
use Data::Dumper;
use Text::SpeedyFx;
my $sfx = Text::SpeedyFx->new;
my $words_bag = $sfx->hash('To be or not to be?');
print Dumper $words_bag;
#$VAR1 = {
# '1422534433' => '1',
# '4120516737' => '2',
# '1439817409' => '2',
# '3087870273' => '1'
# };
my $feature_vector = $sfx->hash_fv("thats the question", 5);
print Dumper $feature_vector;
#$VAR1 = [
# '0',
# '1',
# '0',
# '1',
# '0'
# ];
DESCRIPTION
XS implementation of a very fast combined parser/hasher which works well on a variety of bag-of-word problems.
Original implementation is in Java and was adapted for a better Unicode compliance.
METHODS
new([$seed])
Initialize parser/hasher, optionally using a specified $seed
(default: 1).
hash($string)
Parses $string
and returns a hash reference where keys are hashed tokens and values are respective count.
hash_fv($string, $n)
Parses $string
and returns a feature vector with $n
elements.
hash_min($string)
Parses $string
and returns the hash with the lowest value.
REFERENCES
Extremely Fast Text Feature Extraction for Classification and Indexing by George Forman and Evan Kirshenbaum
AUTHOR
Stanislaw Pusep <stas@sysd.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2012 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.