++ed by:
JETEVE

1 PAUSE user
1 non-PAUSE user.

Aaron Coburn

NAME

Lingua::EN::Tagger - Part-of-speech tagger for English natural language processing.

SYNOPSIS

        # Create a parser object
        my $p = new Lingua::EN::Tagger;
                
        # Add part of speech tags to a text
        my $tagged_text = $p->add_tags( $text );
        
        ...
        
        # Get a list of all nouns and noun phrases with occurence counts
        my %word_list = $p->get_words( $text );
        
        ...
        
        # Get a readable version of the tagged text
        my $readable_text = $p->get_readable( $text );
        

DESCRIPTION

The module is a probability based, corpus-trained tagger that assigns POS tags to English text based on a lookup dictionary and probability values. The tagger determines appropriate tags based on conditional probabilities - it looks at the preceding tag to figure out what the appropriate tag is for the current word. Unknown words will be classified according to word morphology or can be set to be treated as nouns or other parts of speech.

The tagger also recursively extracts as many nouns and noun phrases as it can, using a set of regular expressions.

CLASS METHODS

METHODS

new %PARAMS

Class constructor. Takes a hash with the following parameters (shown with default values):

unknown_word_tag => ''

Tag to assign to unknown words

stem => 1

Stem single words using Lingua::Stem::EN

weight_noun_phrases => 1

When returning occurence counts for a noun phrase, multiply the value by the number of words in the NP.

longest_noun_phrase => 50

Will ignore noun phrases longer than this threshold. This affects only the get_words() and get_nouns() methods.

relax => 0

Relax the Hidden Markov Model: this may improve accuracy for uncommon words, particularly words used polysemously

add_tags TEXT

Examine the string provided and return it fully tagged ( XML style )

get_words TEXT

Given a text string, return as many nouns and noun phrases as possible. Applies add_tags and involves three stages:

  • Tag the text

  • Extract all the maximal noun phrases

  • Recursively extract all noun phrases from the MNPs

get_readable TEXT

Return an easy-on-the-eyes tagged version of a text string. Applies add_tags and reformats to be easier to read.

get_nouns TAGGED_TEXT

Given a POS-tagged text, this method returns all nouns and their occurance frequencies.

get_max_noun_phrases TAGGED_TEXT

Given a POS-tagged text, this method returns only the maximal noun phrases. May be called directly, but is also used by get_noun_phrases

get_noun_phrases TAGGED_TEXT

Similar to get_words, but requires a POS-tagged text as an argument.

install

Reads some included corpus data and saves it in a stored hash on the local filesystem. This is called automatically if the tagger can't find the stored lexicon.

HISTORY

0.03

11/03 Fixed some errors in the text scrubbing methods Shortened and moved lexicon, made things run faster Added a testing suite (Aaron Coburn)

0.02

5/03 Applied fixes for module installer from Nathaniel Irons

0.01

Created 10/02 by Aaron Coburn as LSI::Parser::POS Moved to Lingua::EN::Tagger 2/03 Maciej Ceglowski

AUTHORS

        Maciej Ceglowski <developer@ceglowski.com>
        Aaron Coburn <acoburn@middlebury.edu>

This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 102:

You forgot a '=back' before '=head1'

Around line 788:

You forgot a '=back' before '=head1'

You forgot a '=back' before '=head1'