Lingua::EN::Tokenizer::Offsets - Finds word (token) boundaries, and returns their offsets.
version 0.01_03
use Lingua::EN::Tokenizer::Offsets qw/token_offsets get_tokens/; my $str <<END Hey! Mr. Tambourine Man, play a song for me. I'm not sleepy and there is no place I’m going to. END my $offsets = token_offsets($str); ## Get the offsets. foreach my $o (@$offsets) { my $start = $o->[0]; my $length = $o->[1]-$o->[0]; my $token = substr($text,$start,$length) ## Get a token. # ... } ### or my $tokens = get_tokens($str); foreach my $token (@$tokens) { ## do something with $token }
Returns a tokenized version of $text (space-separated tokens).
$text can be a scalar or a scalar reference.
Returns a reference to an array containin pairs of character offsets, corresponding to the start and end positions of tokens from $text.
Splits $text it into tokens, returning an array reference.
Minor adjusts to offsets (leading/trailing whitespace, etc)
First naive delimitation of tokens.
Given a list of token boundaries offsets and a text, returns an array with the text split into tokens.
Based on the original tokenizer written by Josh Schroeder and provided by Europarl http://www.statmt.org/europarl/.
Lingua::EN::Sentence::Offsets, Lingua::FreeLing3::Tokenizer
André Santos <andrefs@cpan.org>
This software is copyright (c) 2012 by Andre Santos.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install Lingua::EN::Tokenizer::Offsets, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::EN::Tokenizer::Offsets
CPAN shell
perl -MCPAN -e shell install Lingua::EN::Tokenizer::Offsets
For more information on module installation, please visit the detailed CPAN module installation guide.