Grapheme::Ngram - n-grams of Unicode Extended Grapheme Clusters
use Grapheme::Ngram; my $class = 'Grapheme::Ngram'; my @ngrams = $class->ngram($string,$width);
For many applications it's better to work along graphemes.
Building n-grams is one of them.
$object = Grapheme::Ngram->new();
my $array_ref = $object->ngram($string, $width);
$string ...... string of characters
$string
$width ....... length of the resulting tokens. Default is 1.
$width
$array_ref ... reference to array of ngram tokens
$array_ref
Returns one token with the unmodified $string if the number of graphemes in $string is lower than $width. Returns an empty $array_ref if $string is empty or undef. NOTE: maybe this will be changed in future. Defaults to length = 1 if $width is not an integer larger than 0.
my @ngram = $object->from_tokens(\@tokens, $width);
Same as ngram but takes tokens. This method is used by ngram.
ngram
This allows to use a custom tokenizer for e.g. treating 'sh' also as grapheme:
my @tokens = $string =~ m/(Sh|sh|\X)/g;
my @graphemes = $object->_tokenize($string);
This internal method splits $string into a list of graphemes.
http://github.com/wollmers/Grapheme-Ngram
Helmut Wollmersdorfer, <helmut.wollmersdorfer@gmail.com>
Copyright (C) 2014 by Helmut Wollmersdorfer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Grapheme::Ngram, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Grapheme::Ngram
CPAN shell
perl -MCPAN -e shell install Grapheme::Ngram
For more information on module installation, please visit the detailed CPAN module installation guide.