Jeff Kubina
and 1 contributors

NAME

Text::Summarize - Routine to compute summaries of text.

SYNOPSIS

  use strict;
  use warnings;
  use Text::Summarize;
  use Data::Dump qw(dump);
  my $listOfSentences = [
    { id => 0, listOfTokens => [qw(all people are equal)] },
    { id => 1, listOfTokens => [qw(all men are equal)] },
    { id => 2, listOfTokens => [qw(all are equal)] },
  ];
  dump getSumbasicRankingOfSentences(listOfSentences => $listOfSentences);

DESCRIPTION

Text::Summarize contains a routine to score a list of sentences for inclusion in a summary of the text using the SumBasic algorithm from the report Beyond SumBasic: Task-Focused Summarization with Sentence Simplification and Lexical Expansion by L. Vanderwendea, H. Suzukia, C. Brocketta, and A. Nenkovab.

ROUTINES

getSumbasicRankingOfSentences

  use Text::Summarize;
  use Data::Dump qw(dump);
  my $listOfSentences = [
    { id => 0, listOfTokens => [qw(all people are equal)] },
    { id => 1, listOfTokens => [qw(all men are equal)] },
    { id => 2, listOfTokens => [qw(all are equal)] },
  ];
  dump getSumbasicRankingOfSentences(listOfSentences => $listOfSentences);

getSumbasicRankingOfSentences computes the sumBasic score of the list of sentences provided. It returns an array reference containing the pairs [id, score] sorted in descending order of score, where id is from listOfSentences.

listOfSentences
 listOfSentences => [{id => '..', listOfTokens => [...]}, ..., {id => '..', listOfTokens => [...]}]

listOfSentences holds the list of sentences that are to be scored. Each item in the list is a hash reference of the form {id => '..', listOfTokens => [...]} where id is a unique identifier for the sentence and listOfTokens is an array reference of the list of tokens comprizing the sentence.

tokenWeight
 tokenWeight => {}

tokenWeight is a optional hash reference that provides the weight of the tokens defined in listOfSentences. If tokenWeight is defined, but undefined for a token in a sentence, then the tokens weight defaults to zero unless ignoreUndefinedTokens is true, in which case the token is ignored and not used to compute the average weight of the sentences containing it. If tokenWeight is undefined then the weights of the tokens are either their frequency of occurrence in the filtered text, or their textranks if textRankParameters is defined.

ignoreUndefinedTokens
 ignoreUndefinedTokens => 0

If ignoreUndefinedTokens is true, then any tokens for which tokenWeight is undefined are ignored and not used to compute the average weight of a sentence; the default is false.

tokenWeightUpdateFunction
 tokenWeightUpdateFunction => &subroutine (currentTokenWeight, initialTokenWeight, token, selectedSentenceId, selectedSentenceWeight)

tokenWeightUpdateFunction is an optional parameter for defining the function that updates the weight of a token when it is contained in a selected sentence. Five parameters are passed to the subroutine: the token's current weight (float), the token's initial weight (float), the token (string), the id of the selected sentence (string), and the current average weight of the tokens in the selected sentence (float). The default is tokenWeightUpdateFunction_Squared.

textRankParameters
  textRankParameters => undef

If textRankParameters is defined, then the token weights are computed using Text::Categorize::Textrank. The parameters to use for Text::Categorize::Textrank, excluding the listOfTokens parameters, can be set using the hash reference defined by textRankParameters. For example, textRankParameters => {directedGraph => 1} would make the textrank weights be computed using a directed token graph.

tokenWeightUpdateFunction_Squared

Returns the tokens current weight squared.

tokenWeightUpdateFunction_Multiplicative

Returns the tokens current weight times its intial weight.

tokenWeightUpdateFunction_Sentence

Returns the tokens current weight times its the average weight of the tokens in the selected sentence.

INSTALLATION

Use CPAN to install the module and all its prerequisites:

  perl -MCPAN -e shell
  >install Text::Summarize

BUGS

Please email bugs reports or feature requests to bug-text-summarize@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Summarize. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.

AUTHOR

 Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT

Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

KEYWORDS

information processing, summary, summaries, summarization, summarize, sumbasic, textrank

SEE ALSO

Log::Log4perl, Text::Categorize::Textrank, Text::Summarize::En

The SumBasic algorithm for ranking sentences is from Beyond SumBasic: Task-Focused Summarization with Sentence Simplification and Lexical Expansion by L. Vanderwendea, H. Suzukia, C. Brocketta, and A. Nenkovab.