Lingua::EN::Splitter - Split text into words, paragraphs, segments, and tiles
use Lingua::EN::Splitter qw(words paragraphs paragraph_breaks segment_breaks tiles set_tokens_per_tile); my $text = <<EOT; Lingua::EN::Splitter is a useful module that allows text to be split up into words, paragraphs, segments, and tiles. Paragraphs are by default indicated by blank lines. Known segment breaks are indicated by a line with only the word "segment_break" in it. segment_break This module does not make any attempt to guess segment boundaries. For that, see L<Lingua::EN::Segmenter::TextTiling>. EOT # Set the number of tokens per tile to 20 (the default) set_tokens_per_tile(20); my @words = words $text; my @paragraphs = paragraphs $text; my @paragraph_breaks = paragraph_breaks $text; my @segment_breaks = segment_breaks $text; my @tiles = tile words $text; print "@words[0..3,5]"; # Prints "lingua en segmenter is useful" print "@words[43..46,53]"; # Prints "this module does not guess" print $paragraphs[2]; # Prints the third paragraph of the above text print $paragraph_breaks[2]; # Prints which tile the 3rd paragraph starts on print $segment_breaks[1]; # Prints which tile the 2nd segment starts on print $tiles[1]; # Prints @words[20..39] filtered for stopwords # and stemmed # This module can also be used in an object-oriented fashion my $splitter = new Lingua::EN::Splitter; @words = $splitter->words $text;
See synopsis.
This module can be used in an object-oriented fashion or the routines can be exported.
David James <splice@cpan.org>
Lingua::EN::Segmenter::TextTiling, Class::Exporter, http://www.cs.toronto.edu/~james
To install Lingua::EN::Segmenter, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::EN::Segmenter
CPAN shell
perl -MCPAN -e shell install Lingua::EN::Segmenter
For more information on module installation, please visit the detailed CPAN module installation guide.