Add many tokens to the batch, by supplying the string to be tokenized, and arrays of token starts and token ends.
Take an array of Perl scalars and map their string contents to the texts for each token in the batch.
Return a Perl array whose elements correspond to the token texts in this batch.
KinoSearch::Analysis::TokenBatch - A collection of tokens.
# create a TokenBatch with a single Token my $source_batch = KinoSearch::Analysis::TokenBatch->new( text => 'Key Lime Pie', ); # lowercase and split text into multiple tokens, append to new batch my $dest_batch = KinoSearch::Analysis::TokenBatch->new; while ( my $source_token = $source_batch->next ) { my $source_text = $source_token->get_text; while ( $source_text =~ /\s*?(\S+)/g ) { my $new_token = KinoSearch::Analysis::Token->new( text => lc($1), start_offset => $-[1], end_offset => $+[1], ); $dest_batch->append($new_token); } } # prints 'keylimepie' while ( my $token = $dest_batch->next ) { print $token->get_text; }
A TokenBatch is a collection of Tokens objects which you can add to, then iterate over.
my $batch = KinoSearch::Analysis::TokenBatch->new( text => $utf8_text, ); # ... which is equivalent to: my $batch = KinoSearch::Analysis::TokenBatch->new; my $token = KinoSearch::Analysis::Token->new( text => $utf8_text, start_offset => 0, end_offset => length($utf8_text), ); $batch->append($token);
Constructor. Takes one optional hash-style argument.
text - UTF-8 encoded text, used to prime the TokenBatch with a single initial Token.
$batch->append($token);
Tack a Token onto the end of the batch.
$batch->add_many_tokens( $string, \@starts, \@ends ); # or... $batch->add_many_tokens( $string, \@starts, \@ends, \@boosts );
High efficiency method for adding multiple tokens to the batch with one call. The starts and ends, which must be specified in characters (not bytes), will be used to identify substrings of $string to supply as token texts to Token->new.
$string
(Note: boosts should be supplied only for fields which are set to store_pos_boost.)
store_pos_boost
while ( my $token = $batch->next ) { # ... }
Return the next token in the TokenBatch, or undef if out of tokens.
undef
$batch->reset;
Reset the TokenBatch's iterator, so that the next call to next() returns the first Token in the batch.
Copyright 2005-2007 Marvin Humphrey
See KinoSearch version 0.20.
To install KinoSearch, copy and paste the appropriate command in to your terminal.
cpanm
cpanm KinoSearch
CPAN shell
perl -MCPAN -e shell install KinoSearch
For more information on module installation, please visit the detailed CPAN module installation guide.