Thierry Hamon
and 1 contributors

NAME

Lingua::YaTeA::TestifiedTerm - Perl extension for Testified Term

SYNOPSIS

  use Lingua::YaTeA::TestifiedTerm;
  Lingua::YaTeA::TestifiedTerm->new(num_content_words,$words_a,$tag_set,$source,$match_type);

DESCRIPTION

The module implements a representation of the testified terms, i.e. terms from a terminological resource. Those testified terms are used to find corresponding terms in the corpus. Each testified term is described by its identifier (ID), its inflected form IF, its list of part-of-speech tags POS, its lemma LF, the terminological source SOURCE, the list of word components WORDS, the regular expression used to identify it in the corpus (REG_EXP), the indication whether the testified term is found or not (FOUND), its list of occurrences OCCURRENCES and the list of the word index entries (INDEX_SET).

The three information IF, POS and LF are computed from the information issued from their word components.

METHODS

new()

    new($num_content_words,$words_a,$tag_set,$source,$match_type);

This method creates a new object representing a testified term. It sets the fields IF, POS, LF, REG_EXP, INDEX_SET and SOURCE. $words_a and $tag_set are used to initialise the lignuistic information (IF, POS, LF). $source initialises the SOUCE field. $mach_type defines the type of matching for finding the terms in the corpus.

isInLexicon()

    isInLexicon($filtering_lexicon_h, $match_type);

This method checks if all the words of a testified term appear in the lexicon of the text ($filtering_lexicon_h) according to the matching type $match_type: loose (each word matches either a inflected form or a lemmatised form) strict (each word matches a inflected form with the correct Part-of-Speech tag) default (each word mathces a inflected form). The method returns 1 if all the words of the testified term are found in the lexicon, otherwise it returns 0.

$filtering_lexicon_h is a hash table containing the inflected forms, the lemmatised form and the concatenation of the inflected form and the Partof-speech tag (separated by a ~ character) of each word in the text.

buildLinguisticInfos()

    buildLinguisticInfos($words, $tagset);

The method returns the inflected form, the postag list and the lemma of the term candidate as an array (each informationn is the concatenation of the word information found in the array $words and the Part-of-Speech tags $tagset).

getWords()

    getWords();

The mathod returns the list of the words that are components of the term candidate.

setIF()

    setIF();

The method sets the inflected form of the term candidate.

setPOS()

    setPOS();

The method sets the list of the part-of-speech tags of the term candidate.

setLF()

    setLF();

The method sets the canonical form (lemma) of the term candidate.

getIF()

    getIF();

The method returns the inflected form of the term candidate.

getPOS()

    getPOS();

The method returns the list of the part-of-speech tags of the term candidate.

getLF()

    getLF();

The method returns the canonical form (lemma) of the term candidate.

getID()

    getID();

This method returns the identifier of the term candidate.

buildKey()

    buoldKey();

This method builds the key of the testified term, i.e. the concatenation of the inflected form, the postag list and the lemma (separated by the character '~').

getSource()

    getSource(),

The method returns the terminological resource where the testified term is issued.

buildRegularExpression()

    buildRegularExpression($match_type);

The method computes the regular expression corresponding to the term according to the type of matching defined by $mach_type. This regular expression will be used to find the term in the corpus.

getRegExp()

    getReqExp();

The method returns the regular expression corresponding to the testified term (field REG_EXP).

getWord()

    getWord($index);

The method returns the word at the position index in the list of the components of the term candidate.

addOccurrence()

    addOccurrence($phrase_occurrence,$phrase,$key,$fh);

This method looks for the current testified term with the occurrence hrase_occurrence of the phrase $phrase (according to the key $key). And then the occurrence is recorded in the list of occurrences OCCURRENCES. $fh is the file hanlder of a debugging file.

getPositionInPhrase()

    getPositionInPhrase($phrase,$index_a,$fh);

The method returns the position (start and end offsets) of the phrase $phrase according to the index array index_a. $fh is the file hanlder of a debugging file.

setIndexSet()

    setIndexSet($size);

This method initialises the index set with the number betwwen 0 and $size (usually the number of words).

getIndexSet()

    getIndexSet();

This method returns the index set (field INDEX_SET) of the word components.

getOccurrences()

    getOccurrences();

This method returns the list of the occurrences of the term candidate, as an array reference.

SEE ALSO

Sophie Aubin and Thierry Hamon. Improving Term Extraction with Terminological Resources. In Advances in Natural Language Processing (5th International Conference on NLP, FinTAL 2006). pages 380-387. Tapio Salakoski, Filip Ginter, Sampo Pyysalo, Tapio Pahikkala (Eds). August 2006. LNAI 4139.

AUTHOR

Thierry Hamon <thierry.hamon@univ-paris13.fr> and Sophie Aubin <sophie.aubin@lipn.univ-paris13.fr>

COPYRIGHT AND LICENSE

Copyright (C) 2005 by Thierry Hamon and Sophie Aubin

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.