David James


Lingua::EN::Splitter - Split text into words, paragraphs, segments, and tiles


  use Lingua::EN::Splitter qw(words paragraphs paragraph_breaks 
                              segment_breaks tiles set_tokens_per_tile);
  my $text = <<EOT;
  Lingua::EN::Splitter is a useful module that allows text to be split up 
  into words, paragraphs, segments, and tiles.
  Paragraphs are by default indicated by blank lines. Known segment breaks are
  indicated by a line with only the word "segment_break" in it.
  This module does not make any attempt to guess segment boundaries. For that,
  see L<Lingua::EN::Segmenter::TextTiling>.

  # Set the number of tokens per tile to 20 (the default)

  my @words = words $text;
  my @paragraphs = paragraphs $text;
  my @paragraph_breaks = paragraph_breaks $text;
  my @segment_breaks = segment_breaks $text;
  my @tiles = tile words $text;
  print "@words[0..3,5]";     # Prints "lingua en segmenter is useful"
  print "@words[43..46,53]";  # Prints "this module does not guess"
  print $paragraphs[2];       # Prints the third paragraph of the above text
  print $paragraph_breaks[2]; # Prints which tile the 3rd paragraph starts on
  print $segment_breaks[1];   # Prints which tile the 2nd segment starts on
  print $tiles[1];            # Prints @words[20..39] filtered for stopwords 
                              # and stemmed

  # This module can also be used in an object-oriented fashion
  my $splitter = new Lingua::EN::Splitter;
  @words = $splitter->words $text;


See synopsis.

This module can be used in an object-oriented fashion or the routines can be exported.


David James <splice@cpan.org>


Lingua::EN::Segmenter::TextTiling, Class::Exporter, http://www.cs.toronto.edu/~james

Hosting generously
sponsored by Bytemark