The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Text::SpeedyFx - tokenize/hash large amount of strings efficiently

VERSION

version 0.004

SYNOPSIS

    use Data::Dumper;
    use Text::SpeedyFx;

    my $sfx = Text::SpeedyFx->new;

    my $words_bag = $sfx->hash('To be or not to be?');
    print Dumper $words_bag;
    #$VAR1 = {
    #          '1422534433' => '1',
    #          '4120516737' => '2',
    #          '1439817409' => '2',
    #          '3087870273' => '1'
    #        };

    my $feature_vector = $sfx->hash_fv("thats the question", 5);
    print Dumper $feature_vector;
    #$VAR1 = [
    #          '0',
    #          '1',
    #          '0',
    #          '1',
    #          '0'
    #        ];

DESCRIPTION

XS implementation of a very fast combined parser/hasher which works well on a variety of bag-of-word problems.

Original implementation is in Java and was adapted for a better Unicode compliance.

METHODS

new([$seed])

Initialize parser/hasher, optionally using a specified $seed (default: 1).

hash($string)

Parses $string and returns a hash reference where keys are hashed tokens and values are respective count.

hash_fv($string, $n)

Parses $string and returns a feature vector with $n elements.

hash_min($string)

Parses $string and returns the hash with the lowest value.

REFERENCES

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.