The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::EN::WSD::CorpusBased - Word Sense Disambiguation using a domain corpus

SYNOPSIS

   my $wn = WordNet::QueryData->new;
   my $corpus = Lingua::EN::WSD::CorpusBased::Corpus->new('corpus' => '_democorpus_',
                                                          'wnref' => $wn);
                            
   my $wsd = Lingua::EN::WSD::CorpusBased->new('wnref' => $wn,
                                               'cref' => $corpus);

   print join(', ',@{$wsd->wsd('e-mail application')}); # prints 'application#n#3'

DESCRIPTION

This Module allows a disambiguation of word senses based on a domain corpus. The system works based on the assumption, that in one corpus, only one sense of a word is used. Basically, we count for each sense the number of occurrences of one of its synonyms. The one with the highest number is then the right one.

Corpus

The corpus is managed by an additional module Lingua::EN::WSD::CorpusBased::Corpus. It stores the corpus and allows a fast access to its sentences. You should look into the documentation of the corpus module, since it expects the corpus to be in a preprocessed state.

METHODS

new

Creates a new object. Takes a couple of arguments:

wnref A reference to a WordNet::QueryData object. Obligatory.

cref A reference to a Corpus object. Obligatory.

debug A switch for the debug mode of the object. Optional, default: 0.

stem If you set this switch to 1, the term in question will be lemmatized using the stem module. If set to 0, only the original term will be sent to WordNet. In this case, it is possible that no WordNet entry is found for the term, leading to an empty list returned by the wsd-method. Optional, default: 1.

strict Controls whether the algorithm returns all senses or no sense in cases where they all are weighted equally. This happens especially, if the terms are not mentioned at all in the corpus (in which case I would recommend a larger corpus). Optional, default: 0.

hyponyms Controls whether we use not only synonyms, but also hyponyms. Optional, default 1

hypernyms Controls whether we use not only synonyms, but also hypernyms. Optional, default 1. Returns a blessed reference to the object or -1 if you did not supply references to objects of WordNet::QueryData and Lingua::EN::WSD::CorpusBased::Corpus.

wsd
    $obj->wsd($term);

The method for doing the word sense disambiguation. Returns a reference to a list of senses which seem the most probable for the given term. This can be the empty list (depends on your settings for 'strict'). The method returns -1 if you do not provide the term to disambiguate.

term The term you want to disambiguate. Required.

debug

Returns the debug level in which the object is currently running.

Internal Methods

init

Internal method. Prepares the object for a disambiguation run. Is automatically called by the method wsd. Has to be called before any call of sense, because it does some preprocessing. Takes one parameter.

term The term in question. Required.

sense
    $obj->sense;

Internal method. Iterates over all senses of the given (via init) term and returns a reference to a list of the best senses. Takes no arguments.

v
    $obj->v($synset);

Internal method. Calculates the weight for a synset as sense of the given term. Returns the weight or -1 if $synset is undefined or an empty string.

count
    $obj->count(@words);

Internal method. Just a wrapper for the appropriate method of the corpus-object. Returns the number of occurrences or -1 if the corpus object is not available.

hyponyms
    $obj->hyponyms($synset);

Internal method. Returns a list of hyponyms (synsets) for a given (as argument) word. If the synset argument is not provided, undefined or an empty string, the method returns an empty list.

hypernyms
    $obj->hypernyms($synset);

Internal method. Returns a list of hypernyms (synsets) for a given (as argument) word. If the synset argument is not provided, undefined or an empty string, the method returns an empty list.

synonyms
    $obj->synonyms($synset);

Internal method. Returns a list of synonyms for a synset, which is given as a an argument. The returned list contains words, not synsets. If the synset argument is not provided, undefined or an empty string, the method returns an empty list.

synsets
    $obj->synsets($word);

Internal method. Returns a reference to a list of synsets for the given term. This list includes all possible part of speeches (as long as they are defined in WordNet). Returns a reference to an empty list if something goes wrong (i.e. no term has been given to the object).

term_replace

Internal method. Returns the term in question after replacing the last word with the second argument. The returned string has underscores instead of spaces. Returns -1 if no term is known to the object.

Internal method. Returns the grammatical head of the term. In case of multi-word expressions, this is the last word of the expression, otherwise it's the word itself. Returns -1 if no term is given.

term

Internal method. Returns the term in question with underscores instead of spaces. Returns -1 if no term is given.

ready

Internal method. Returns a true value if the object is ready for disambiguation. This method especially checks if the term is set via init and if the preprocessing went ok.

BUGS

None so far. If you find some, please report them to me, reiter@cpan.org.

TODO

  • A lot more useful debug output

  • Making more methods externally useful, allowing a more flexible use of the module.

SEE ALSO

It might be interesting to look at the modules WordNet::SenseRelate::AllWords and WordNet::SenseRelate::TargetWord, since they work in the same area.

COPYRIGHT

Copyright (c) 2006 by Nils Reiter.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.