Text::Summarize::En - Routine to summarize English text.


  use strict;
  use warnings;
  use Text::Summarize::En;
  use Data::Dump qw(dump);
  my $summarizerEn = Text::Summarize::En->new();
  my $text         = 'All people are equal. All men are equal. All are equal.';
  dump $summarizerEn->getSummaryUsingSumbasic(listOfText => [$text]);


Text::Summarize contains routines for ranking the sentences in English text for inclusion in a summary using the sumBasic algorithm.



The method new creates an instance of the Text::Summarize::En class with the following parameters:

 endingSentenceTag => 'PP'

endingSentenceTag is the part-of-speech tag that should be used to indicate the end of a sentence. The default is 'PP'. The value of this tag must be a tag generated by the module Lingua::EN::Tagger.

 listOfPOSTypesToKeep => [qw(CONTENT_WORDS)]

The sumBasic algorithm preprocesses the text so that only certain parts-of-speech (POS) are retained and used to rank the sentences. The module Lingua::EN::Tagger is used to tag the parts-of-speech of the text. The parts-of-speech retained can be specified by word types, where the type is a combination of 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', or 'VERBS'. The default is [qw(CONTENT_WORDS)], which equates to [qw(CONTENT_ADVERBS, VERBS, ADJECTIVES, NOUNS)].

 listOfPOSTagsToKeep => [...]

listOfPOSTagsToKeep provides finer control over the parts-of-speech to be retained when filtering the tagged text. For a list of all the possible tags call getListOfPartOfSpeechTags().



getSummaryUsingSumbasic computes the summary of text using the sumBasic algorithm.

 listOfStemmedTaggedSentences => [...]

listOfStemmedTaggedSentences is an array reference containing the list of stemmed and part-of-speech tagged sentences from Text::StemTagPos. If listOfStemmedTaggedSentences is not defined, then the text to be processed should be provided via listOfText.

 listOfText => [...]

listOfText is an array reference containing the strings of text to be summarized. listOfText is only used if listOfStemmedTaggedSentences is undefined.

 tokenWeight => {}

tokenWeights is an optional hash reference that can provide the weights for the tokens provided by listOfStemmedTaggedSentences or listOfText. If tokenWeights is not defined then the weight of a token is just its frequency of occurrence in the filtered text. If textRankParameters is defined, then the token weights are computed using Text::Categorize::Textrank.

  textRankParameters => undef

If textRankParameters is defined, then the token weights for the sumBasic algorithm are computed using Text::Categorize::Textrank. The parameters to use for Text::Categorize::Textrank, excluding the listOfTokens parameters, can be set using the hash reference defined by textRankParameters. For example, textRankParameters => {directedGraph => 1} would make the textrank weights be computed using a directed token graph.


Use CPAN to install the module and all its prerequisites:

  perl -MCPAN -e shell
  >install Text::Summarize


The SumBasic algorithm for ranking sentences is from Beyond SumBasic: Task-Focused Summarization with Sentence Simplification and Lexical Expansion by L. Vanderwendea, H. Suzukia, C. Brocketta, and A. Nenkovab.