NAME
SYNOPSIS
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'The first sentence. Sentence number two.';
dump $stemTagger->getStemmedAndTaggedText ($text);
DESCRIPTION
"Text::StemTagPOS" uses the modules Lingua::Stem::Snowball and
Lingua::EN::Tagger to do part-of-speech tagging and stemming of English
text. It was developed to pre-process text for the module
Text::Categorize::Textrank. Encoding of all text should be in Perl's
internal format; see Encode for converting text from various encodes to
a Perl string.
CONSTRUCTOR
"new"
The method "new" creates an instance of the "Text::StemTagPOS" class
with the following parameters:
"isoLangCode"
isoLangCode => 'en'
"isoLangCode" is the ISO language code of the language that will be
tagged and stemmed by the object. It must be 'en', which is the
default; other languages may be added when POS taggers for them are
added to CPAN.
"endingSentenceTag"
endingSentenceTag => 'PP'
"endingSentenceTag" is the part-of-speech tag from
Lingua::EN::Tagger that will be used to indicate the end of a
sentence. The default is 'PP'. The value of "endingSentenceTag" must
be a tag generated by the module Lingua::EN::Tagger; see method
"getListOfPartOfSpeechTags" for all the possible tags; which are
based on the Penn Treebank tagset.
"listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep"
listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]
The method "getTaggedTextToKeep" uses "listOfPOSTypesToKeep" and
"listOfPOSTagsToKeep" to build the default list of the
parts-of-speech to be retained when filtering previously tagged
text. The default list is "[qw(TEXTRANK_WORDS)]", which is all the
nouns and adjectives in the text, as used in the textrank algorithm.
Permitted types for "getTaggedTextToKeep" are 'ALL', 'ADJECTIVES',
'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION',
'TEXTRANK_WORDS', and 'VERBS'. "listOfPOSTagsToKeep" provides finer
control over the parts-of-speech to be retained. For a list of all
the possible tags see method "getListOfPartOfSpeechTags".
METHODS
"getStemmedAndTaggedText"
getStemmedAndTaggedText (@Text, $Text, \@Text)
The method "getStemmedAndTaggedText" returns a hierarchy of array
references containing the stemmed words, the original words, their
part-of-speech tag, and their word position index within the original
text. The hierarchy is of the form
[
[ # sentence level: first sentence.
[ # word level: first word.
stemmed word, original word, part-of-speech tag, word index
]
[ # word level: second word.
stemmed word, original word, part-of-speech tag, word index
]
...
]
[ # sentence level: second sentence.
[ # word level: first word.
stemmed word, original word, part-of-speech tag, word index
]
[ # word level: second word.
stemmed word, original word, part-of-speech tag, word index
]
...
]
]
Its only parameters are any combination of strings of text as scalars,
references to scalars, arrays of strings of text, or references to
arrays of strings of text, etc... The following examples below show the
various ways to call the method; note that the constants
Text::StemTagPOS::WORD_STEMMED, Text::StemTagPOS::WORD_ORIGINAL,
Text::StemTagPOS::WORD_POSTAG, and Text::StemTagPOS::WORD_INDEX are used to
access the information about each word.
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'The first sentence. Sentence number two.';
my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
dump $stemmedTaggedText;
# $stemmedTaggedText will containing the following:
# [
# [
# ["the", "The", "/DET", 0],
# ["first", "first", "/JJ", 1],
# ["sentenc", "sentence", "/NN", 2],
# [".", ".", "/PP", 3],
# ],
# [
# ["sentenc", "Sentence", "/NN", 4],
# ["number", "number", "/NN", 5],
# ["two", "two", "/CD", 6],
# [".", ".", "/PP", 7],
# ],
# ]
my $word = $stemmedTaggedText->[0][0];
print
'WORD_STEMMED: ' .
"'" . $word->[Text::StemTagPOS::WORD_STEMMED] . "'\n" .
'WORD_ORIGINAL: ' .
"'" . $word->[Text::StemTagPOS::WORD_ORIGINAL] . "'\n" .
'WORD_POSTAG: ' .
"'" . $word->[Text::StemTagPOS::WORD_POSTAG] . "'\n" .
'WORD_INDEX: ' .
$word->[Text::StemTagPOS::WORD_INDEX] . "\n";
# WORD_STEMMED: 'the'
# WORD_ORIGINAL: 'The'
# WORD_POSTAG: '/DET'
# WORD_INDEX: '0'
The following example shows the various ways the text can be passed to
the method:
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'This is a sentence with seven words.';
dump $stemTagger->getStemmedAndTaggedText ($text,
[$text, \$text], ($text, \$text));
"getTaggedTextToKeep"
getTaggedTextToKeep (stemmedTaggedText => [...],
listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]);
The method "getTaggedTextToKeep" returns all the array references of the
words that have a part-of-speech tag that is of a type specified by
"listOfPOSTypesToKeep" or "listOfPOSTagsToKeep". The word lists returned
have the same hierarchical sentence structure used by
"stemmedTaggedText". Note "listOfPOSTypesToKeep" and
"listOfPOSTagsToKeep" are optional parameters, if neither is defined,
then the values used when the object was instantiated are used. If one
of them is defined, its values override the default values.
"stemmedTaggedText"
stemmedTaggedText => [...]
"stemmedTaggedText" is the array reference returned by
"getStemmedAndTaggedText" or a previous call to
"getTaggedTextToKeep".
"listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep"
listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]
"listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" define the list of
parts-of-speech types to be retained when filtering previously
tagged text. Permitted values for "listOfPOSTypesToKeep" are are
'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS',
'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value
of "listOfPOSTagsToKeep" see the method "getListOfPartOfSpeechTags".
Note "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" are optional
parameters, if neither is defined, then the values used when the
object was instantiated are used. If one of them is defined, its
values override the default values.
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'This is the first sentence. This is the last sentence.';
my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
dump $stemTagger->getTaggedTextToKeep (
stemmedTaggedText => $stemmedTaggedText);
# only the nouns and adjetives are retained by default.
# [
# [
# ["first", "first", "/JJ", 3],
# ["sentenc", "sentence", "/NN", 4],
# ],
# [
# ["last", "last", "/JJ", 9],
# ["sentenc", "sentence", "/NN", 10],
# ],
# ]
"getWordsPhrasesInTaggedText"
getWordsPhrasesInTaggedText (stemmedTaggedText => ...,
listOfPhrasesToFind => [...], listOfPOSTypesToKeep => [...],
listOfPOSTagsToKeep => [...]);
The method "getWordsPhrasesInTaggedText" returns a reference to an array
where each entry in the array corresponds to the word or phrase in
"listOfPhrasesToFind". The value of each entry is a list of word indices
where the words or phrases were found. Each list contains integer pairs
of the form [first-word-index, last-word-index] where first-word-index
is the index to the first word of the phrase and last-word-index the
index of the last word. The values of the index are those assigned to
the stemmed and tagged word in "stemmedTaggedText".
[
[ # first phrase locations
[first word index, last word index],
[first word index, last word index], ...]
]
[ # second phrase locations
[first word index, last word index],
[first word index, last word index], ...]
]
...
]
"stemmedTaggedText"
stemmedTaggedText => [...]
"stemmedTaggedText" is the array reference returned by
"getStemmedAndTaggedText" or "getTaggedTextToKeep".
"listOfPhrasesToFind"
listOfPhrasesToFind => [...]
"listOfPhrasesToFind" is an array reference containing a list of
strings of text that are either single words or phrases that are to
be located in the text provided by "stemmedTaggedText". Before the
words or phrases are located they are filtered using
"listOfPOSTypesToKeep" or "listOfPOSTagsToKeep".
"listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep"
listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]
"listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" defines the list of
parts-of-speech types to be retained when filtering previously
tagged text. Permitted values for "listOfPOSTypesToKeep" are are
'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS',
'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value
of "listOfPOSTagsToKeep" see the method "getListOfPartOfSpeechTags".
Note "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" are optional
parameters, if neither is defined, then the values used when the
object was instantiated are used. If one of them is defined, its
values override the default values.
The code below illustrates the output format:
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'This is the first sentence. This is the last sentence.';
my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
dump $stemmedTaggedText;
my $listOfWordsOrPhrasesToFind = ['first sentence','this is',
'third sentence', 'sentence'];
my $phraseLocations = $stemTagger->getWordsPhrasesInTaggedText (
listOfPOSTypesToKeep => [qw(ALL)],
stemmedTaggedText => $stemmedTaggedText,
listOfWordsOrPhrasesToFind => $listOfWordsOrPhrasesToFind);
dump $phraseLocations;
# [
# [[3, 4]], # 'first sentence'
# [[0, 1], [6, 7]], # 'this is': note period in text has index 5.
# [], # 'third sentence'
# [[4, 4], [10, 10]] # 'sentence'
# ]
"getListOfPartOfSpeechTags"
The method "getListOfPartOfSpeechTags" takes no parameters. It returns
an array reference where each item in the list is of the form "[part of
speech tag, description, examples]". It is meant for getting the
part-of-speech tags that can be used to populate "listOfPOSTagsToKeep".
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
dump $stemTagger->getListOfPartOfSpeechTags;
"getListOfStemmedWordsInText"
The method "getListOfStemmedWordsInText" returns an array reference of
the sorted stemmed words in the text given by "stemmedTaggedText".
"stemmedTaggedText"
stemmedTaggedText => [...]
"stemmedTaggedText" is the array reference returned by
"getStemmedAndTaggedText" or "getTaggedTextToKeep" of the text.
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'The first sentence. Sentence number two.';
my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
dump $stemTagger->getStemmedAndTaggedText (stemmedTaggedText => $stemmedTaggedText);
"getListOfStemmedWordsInAllDocuments"
The method "getListOfStemmedWordsInAllDocuments" returns an array
reference of the sorted stemmed words of the intersection of all the
words in the documents given by "listOfStemmedTaggedText";
"listOfStemmedTaggedText"
listOfStemmedTaggedText => [...]
"listOfStemmedTaggedText" is a list of document references returned
by "getStemmedAndTaggedText" or "getTaggedTextToKeep".
INSTALLATION
To install the module run the following commands:
perl Makefile.PL
make
make test
make install
If you are on a windows box you should use 'nmake' rather than 'make'.
AUTHOR
Jeff Kubina<jeff.kubina@gmail.com>
COPYRIGHT
Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is
free software; you can redistribute it and/or modify it under the same
terms as Perl itself.
The full text of the license can be found in the LICENSE file included
with this module.
KEYWORDS
natural language processing, NLP, part of speech tagging, POS, stemming
SEE ALSO
Encode, perlunicode, Lingua::Stem::Snowball, Lingua::EN::Tagger,
Text::Iconv, Text::Categorize::Textrank, utf8