Text::Categorize::Textrank - Method to rank potential keywords of text.
Text::Categorize::Textrank
use strict; use warnings; use Text::Categorize::Textrank; use Data::Dump qw(dump); my $listOfTokens = [ [qw(This is the first sentence)], [qw(Here is the second sentence)] ]; my $hashOfTextrankValues = getTextrankOfListOfTokens(listOfTokens => $listOfTokens); dump $hashOfTextrankValues;
Text::Categorize::Textrank provides a routine for ranking the words in text as potential keywords. It implements a version of the textrank algorithm from the report TextRank: Bringing Order into Texts by R. Mihalcea and P. Tarau.
getTextrankOfListOfTokens
The routine getTextrankOfListOfTokens returns a hash reference containing the textrank value for all the tokens in the lists provided; the textrank values sum to one. The textrank of a token is its pagerank in the graph obtained by joining neighboring tokens with an edge, called the token graph.
Usually, listOfTokens should not be applied to all the words of the text. The complete textrank algorithm first filters the words in the text to only the nouns and adjectives. See Text::Categorize::Textrank::En to compute the textrank of English text.
listOfTokens
listOfTokens => [[...], [...], ...[...]]
listOfTokens is an array reference containing the list of tokens that are to be ranked using textrank. Each list is also an array reference of tokens that should correspond to the list of tokens in a sentence. For example, [[qw(This is the first sentence)], [qw(Here is the second sentence)]].
[[qw(This is the first sentence)], [qw(Here is the second sentence)]]
edgeCreationSpan
edgeCreationSpan => 1
For each token in the listOfTokens, edgeCreationSpan is the number of successive tokens used to make an edge in the token graph. For example, if edgeCreationSpan is two, then given the token sequence "apple orange pear" the edges [apple, orange] and [apple, pear] will be added to the token graph for the token apple. The default is one.
"apple orange pear"
[apple, orange]
[apple, pear]
apple
Note that loop edges are ignored. For example, if edgeCreationSpan is two, then given the token sequence "daba daba doo" the edge [daba, daba] is disguarded but the edge [daba, doo] is added to the token graph.
"daba daba doo"
[daba, daba]
[daba, doo]
directedGraph
directedGraph => 0
If directedGraph is true, the textranks are computed from the directed token graph, if false, they are computed from the undirected version of the graph. The default is false.
dampeningFactor
dampeningFactor => 0.85
When computing the textranks of the token graph, the dampening factor specified by dampeningFactor will be used; it should range from zero to one. The default is 0.85.
addEdgesSpanningLists
addEdgesSpanningLists => 1
If addEdgesSpanningLists is true, then when building the token graph, links between the tokens at the end of a list and the beginning of the next list will be made. For example, for the lists [[qw(This is the first list)], [qw(Here is the second list)]] the edge [list, Here] will be added to the token graph. The default is true.
[[qw(This is the first list)], [qw(Here is the second list)]]
[list, Here]
tokenWeights
tokenWeights => {}
tokenWeights is an optional hash reference that can provide a weight for a subset of the tokens provided by listOfTokens. If tokenWeights is not defined for any token in listOfTokens, then each token has a weight of one. If tokenWeights is defined for at least one node in the graph, then the default weight of any undefined token is zero.
To install the module run the following commands:
perl Makefile.PL make make test make install
If you are on a windows box you should use 'nmake' rather than 'make'.
Please email bugs reports or feature requests to bug-text-categorize-textrank@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Categorize-Textrank. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
bug-text-categorize-textrank@rt.cpan.org
Jeff Kubina<jeff.kubina@gmail.com>
Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
categorize, keywords, keyphrases, nlp, pagerank, textrank
Graph, Graph::Centrality::Pagerank, Log::Log4perl, Text::Categorize::Textrank::En
To install Text::Categorize::Textrank, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Categorize::Textrank
CPAN shell
perl -MCPAN -e shell install Text::Categorize::Textrank
For more information on module installation, please visit the detailed CPAN module installation guide.