README - metacpan.org


            
              1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
              NAME
SYNOPSIS
      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'The first sentence. Sentence number two.';
      dump $stemTagger->getStemmedAndTaggedText ($text);
DESCRIPTION
    "Text::StemTagPOS" uses the modules Lingua::Stem::Snowball and
    Lingua::EN::Tagger to do part-of-speech tagging and stemming of English
    text. It was developed to pre-process text for the module
    Text::Categorize::Textrank. Encoding of all text should be in Perl's
    internal format; see Encode for converting text from various encodes to
    a Perl string.
CONSTRUCTOR
  "new"
    The method "new" creates an instance of the "Text::StemTagPOS" class
    with the following parameters:
    "isoLangCode"
         isoLangCode => 'en'
        "isoLangCode" is the ISO language code of the language that will be
        tagged and stemmed by the object. It must be 'en', which is the
        default; other languages may be added when POS taggers for them are
        added to CPAN.
    "endingSentenceTag"
         endingSentenceTag => 'PP'
        "endingSentenceTag" is the part-of-speech tag from
        Lingua::EN::Tagger that will be used to indicate the end of a
        sentence. The default is 'PP'. The value of "endingSentenceTag" must
        be a tag generated by the module Lingua::EN::Tagger; see method
        "getListOfPartOfSpeechTags" for all the possible tags; which are
        based on the Penn Treebank tagset.
    "listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep"
         listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]
        The method "getTaggedTextToKeep" uses "listOfPOSTypesToKeep" and
        "listOfPOSTagsToKeep" to build the default list of the
        parts-of-speech to be retained when filtering previously tagged
        text. The default list is "[qw(TEXTRANK_WORDS)]", which is all the
        nouns and adjectives in the text, as used in the textrank algorithm.
        Permitted types for "getTaggedTextToKeep" are 'ALL', 'ADJECTIVES',
        'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION',
        'TEXTRANK_WORDS', and 'VERBS'. "listOfPOSTagsToKeep" provides finer
        control over the parts-of-speech to be retained. For a list of all
        the possible tags see method "getListOfPartOfSpeechTags".
METHODS
  "getStemmedAndTaggedText"
     getStemmedAndTaggedText (@Text, $Text, \@Text)
    The method "getStemmedAndTaggedText" returns a hierarchy of array
    references containing the stemmed words, the original words, their
    part-of-speech tag, and their word position index within the original
    text. The hierarchy is of the form
      [
        [ # sentence level: first sentence.
          [ # word level: first word.
            stemmed word, original word, part-of-speech tag, word index
          ]
          [ # word level: second word.
            stemmed word, original word, part-of-speech tag, word index
          ]
          ...
        ]
        [ # sentence level: second sentence.
          [ # word level: first word.
            stemmed word, original word, part-of-speech tag, word index
          ]
          [ # word level: second word.
            stemmed word, original word, part-of-speech tag, word index
          ]
          ...
        ]
      ]
    Its only parameters are any combination of strings of text as scalars,
    references to scalars, arrays of strings of text, or references to
    arrays of strings of text, etc... The following examples below show the
    various ways to call the method; note that the constants
    Text::StemTagPOS::WORD_STEMMED, Text::StemTagPOS::WORD_ORIGINAL,
    Text::StemTagPOS::WORD_POSTAG, and Text::StemTagPOS::WORD_INDEX are used to
    access the information about each word.
      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'The first sentence. Sentence number two.';
      my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
      dump $stemmedTaggedText;
      # $stemmedTaggedText will containing the following:
      # [
      #   [
      #     ["the", "The", "/DET", 0],
      #     ["first", "first", "/JJ", 1],
      #     ["sentenc", "sentence", "/NN", 2],
      #     [".", ".", "/PP", 3],
      #   ],
      #   [
      #     ["sentenc", "Sentence", "/NN", 4],
      #     ["number", "number", "/NN", 5],
      #     ["two", "two", "/CD", 6],
      #     [".", ".", "/PP", 7],
      #   ],
      # ]
      my $word = $stemmedTaggedText->[0][0];
      print
        'WORD_STEMMED: ' .
        "'" . $word->[Text::StemTagPOS::WORD_STEMMED] . "'\n" .
        'WORD_ORIGINAL: ' .
        "'" . $word->[Text::StemTagPOS::WORD_ORIGINAL] . "'\n" .
        'WORD_POSTAG: ' .
        "'" . $word->[Text::StemTagPOS::WORD_POSTAG] . "'\n" .
        'WORD_INDEX: ' .
        $word->[Text::StemTagPOS::WORD_INDEX] . "\n";
      # WORD_STEMMED: 'the'
      # WORD_ORIGINAL: 'The'
      # WORD_POSTAG: '/DET'
      # WORD_INDEX: '0'
    The following example shows the various ways the text can be passed to
    the method:
      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'This is a sentence with seven words.';
      dump $stemTagger->getStemmedAndTaggedText ($text,
        [$text, \$text], ($text, \$text));
  "getTaggedTextToKeep"
     getTaggedTextToKeep (stemmedTaggedText => [...],
      listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]);
    The method "getTaggedTextToKeep" returns all the array references of the
    words that have a part-of-speech tag that is of a type specified by
    "listOfPOSTypesToKeep" or "listOfPOSTagsToKeep". The word lists returned
    have the same hierarchical sentence structure used by
    "stemmedTaggedText". Note "listOfPOSTypesToKeep" and
    "listOfPOSTagsToKeep" are optional parameters, if neither is defined,
    then the values used when the object was instantiated are used. If one
    of them is defined, its values override the default values.
    "stemmedTaggedText"
         stemmedTaggedText => [...]
        "stemmedTaggedText" is the array reference returned by
        "getStemmedAndTaggedText" or a previous call to
        "getTaggedTextToKeep".
    "listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep"
         listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]
        "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" define the list of
        parts-of-speech types to be retained when filtering previously
        tagged text. Permitted values for "listOfPOSTypesToKeep" are are
        'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS',
        'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value
        of "listOfPOSTagsToKeep" see the method "getListOfPartOfSpeechTags".
        Note "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" are optional
        parameters, if neither is defined, then the values used when the
        object was instantiated are used. If one of them is defined, its
        values override the default values.
      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'This is the first sentence. This is the last sentence.';
      my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
      dump $stemTagger->getTaggedTextToKeep (
        stemmedTaggedText => $stemmedTaggedText);
      # only the nouns and adjetives are retained by default.
      # [
      #   [
      #     ["first", "first", "/JJ", 3],
      #     ["sentenc", "sentence", "/NN", 4],
      #   ],
      #   [
      #     ["last", "last", "/JJ", 9],
      #     ["sentenc", "sentence", "/NN", 10],
      #   ],
      # ]
  "getWordsPhrasesInTaggedText"
     getWordsPhrasesInTaggedText (stemmedTaggedText => ...,
        listOfPhrasesToFind => [...],  listOfPOSTypesToKeep => [...],
        listOfPOSTagsToKeep => [...]);
    The method "getWordsPhrasesInTaggedText" returns a reference to an array
    where each entry in the array corresponds to the word or phrase in
    "listOfPhrasesToFind". The value of each entry is a list of word indices
    where the words or phrases were found. Each list contains integer pairs
    of the form [first-word-index, last-word-index] where first-word-index
    is the index to the first word of the phrase and last-word-index the
    index of the last word. The values of the index are those assigned to
    the stemmed and tagged word in "stemmedTaggedText".
      [
        [ # first phrase locations
          [first word index, last word index],
          [first word index, last word index], ...]
        ]
        [ # second phrase locations
          [first word index, last word index],
          [first word index, last word index], ...]
        ]
        ...
      ]
    "stemmedTaggedText"
         stemmedTaggedText => [...]
        "stemmedTaggedText" is the array reference returned by
        "getStemmedAndTaggedText" or "getTaggedTextToKeep".
    "listOfPhrasesToFind"
         listOfPhrasesToFind => [...]
        "listOfPhrasesToFind" is an array reference containing a list of
        strings of text that are either single words or phrases that are to
        be located in the text provided by "stemmedTaggedText". Before the
        words or phrases are located they are filtered using
        "listOfPOSTypesToKeep" or "listOfPOSTagsToKeep".
    "listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep"
         listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]
        "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" defines the list of
        parts-of-speech types to be retained when filtering previously
        tagged text. Permitted values for "listOfPOSTypesToKeep" are are
        'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS',
        'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value
        of "listOfPOSTagsToKeep" see the method "getListOfPartOfSpeechTags".
        Note "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" are optional
        parameters, if neither is defined, then the values used when the
        object was instantiated are used. If one of them is defined, its
        values override the default values.
    The code below illustrates the output format:
      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'This is the first sentence. This is the last sentence.';
      my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
      dump $stemmedTaggedText;
      my $listOfWordsOrPhrasesToFind = ['first sentence','this is',
        'third sentence', 'sentence'];
      my $phraseLocations = $stemTagger->getWordsPhrasesInTaggedText (
        listOfPOSTypesToKeep => [qw(ALL)],
        stemmedTaggedText => $stemmedTaggedText,
        listOfWordsOrPhrasesToFind => $listOfWordsOrPhrasesToFind);
      dump $phraseLocations;
      # [
      #   [[3, 4]],           # 'first sentence'
      #   [[0, 1], [6, 7]],   # 'this is': note period in text has index 5.
      #   [],                 # 'third sentence'
      #   [[4, 4], [10, 10]]  # 'sentence'
      # ]
  "getListOfPartOfSpeechTags"
    The method "getListOfPartOfSpeechTags" takes no parameters. It returns
    an array reference where each item in the list is of the form "[part of
    speech tag, description, examples]". It is meant for getting the
    part-of-speech tags that can be used to populate "listOfPOSTagsToKeep".
      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      dump $stemTagger->getListOfPartOfSpeechTags;
  "getListOfStemmedWordsInText"
    The method "getListOfStemmedWordsInText" returns an array reference of
    the sorted stemmed words in the text given by "stemmedTaggedText".
    "stemmedTaggedText"
         stemmedTaggedText => [...]
        "stemmedTaggedText" is the array reference returned by
        "getStemmedAndTaggedText" or "getTaggedTextToKeep" of the text.
      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'The first sentence. Sentence number two.';
      my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
      dump $stemTagger->getStemmedAndTaggedText (stemmedTaggedText => $stemmedTaggedText);
  "getListOfStemmedWordsInAllDocuments"
    The method "getListOfStemmedWordsInAllDocuments" returns an array
    reference of the sorted stemmed words of the intersection of all the
    words in the documents given by "listOfStemmedTaggedText";
    "listOfStemmedTaggedText"
         listOfStemmedTaggedText => [...]
        "listOfStemmedTaggedText" is a list of document references returned
        by "getStemmedAndTaggedText" or "getTaggedTextToKeep".
INSTALLATION
    To install the module run the following commands:
      perl Makefile.PL
      make
      make test
      make install
    If you are on a windows box you should use 'nmake' rather than 'make'.
AUTHOR
     Jeff Kubina<jeff.kubina@gmail.com>
COPYRIGHT
    Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is
    free software; you can redistribute it and/or modify it under the same
    terms as Perl itself.
    The full text of the license can be found in the LICENSE file included
    with this module.
KEYWORDS
    natural language processing, NLP, part of speech tagging, POS, stemming
SEE ALSO
    Encode, perlunicode, Lingua::Stem::Snowball, Lingua::EN::Tagger,
    Text::Iconv, Text::Categorize::Textrank, utf8
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)