Thierry Hamon
and 1 contributors

NAME

yatea - Perl script for extracting terms from a corpus of texts and providing a syntactic analysis in a head-modifier representation.

SYNOPSIS

yatea [-help] [-man] [--rcfile=file] file

OPTIONS

--help, -h, -? brief help message
--man, -m full documentation
--rcfile=file load the given configuration file
file corpus of texts in Flemm or TreeTagger output format

DESCRIPTION

YaTeA aims at extracting noun phrases that look like terms from a corpus. It also provides their syntactic analysis in a head-modifier format.

As input, the term extractor requires a corpus which has been segmented into words and sentences, lemmatized and tagged with part-of-speech (POS) information. The input file is encoded in UTF-8. Currently, the text has to be postagged by TreeTagger (for French and English), or TreeTagger+Flemm for French.

As output, the script makes a directory containing the results in various formats (according the configuration): XML, text and TreeTgger-like output.

USE OF YATEA

Using YaTeA requires to have a output of TreeTagger (<http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html>. It will be the input of YaTeA.

To run yatea, a configuration file is needed (usually yatea.rc in /etc/yatea). This file describes the behaviour of the term extractor for French and English texts. You have to indicate the language of the configuration file you use (see section CONFIGURATION FILE FORMAT for more details, below). It also indicates the path of the configuration files for the linguistic analysis. You have to adapt the path if your configuration is not standard.

An example of the configuration file is available in etc/yatea/yatea.rc from the archive directory.

The most common command line to run YaTeA is

yatea TreeTaggerOutputFile.ttg

It is assumed that the directory containing the program yatea is in your PATH variable and that the configuration file is /etc/yatea/yatea.rc.

If you are not allow to copy the configuration file yatea.rc in the directory /etc/yatea (or create this directory), or if you want to use your own configuration file, you can specify the file with its path by using the option --rcfile

yatea --rcfile MyYaTeAConfig.rc TreeTaggerOutputFile.ttg

EXPLANATION ON THE ANALYSIS

The main strategy of analysis of the term candidates is based on the exploitation of simple parsing patterns and endogenous disambiguation. Exogenous disambiguation is also made possible for the identification and the analysis of term candidates by the use of external resources, i.e. lists of testified terms.

Endogenous disambiguation consists in the exploitation of intermediate chunking and parsing results for the parsing of a given Maximal Noun Phrase (MNP). This feature allows the parse of complex noun phrases using a limited number of simple parsing patterns (80 patterns containing a maximum of 3 content words in the experiments described below). All the MNPs corresponding to parsing patterns are parsed first. In a second step, remaining unparsed MNPs are processed using the results of the first step as islands of reliability. An island of reliability is a subsequence (contiguous or not) of a MNP that corresponds to a shorter term candidate that was parsed during the first step of the parsing process. This subsequence along with its internal analysis is used as an anchor in the parsing of the MNP. Islands are used to simplify the POS sequence of the MNP for which no parsing pattern was found. The subsequence covered by the island is reduced to its syntactic head. In addition, islands increase the degree of reliability of the parse. When no resource is provided and as there is no parsing pattern defined for the complete POS sequence "NN NN NN of NN" corresponding to the term candidate "Northern blot analysis of cwlH", the progressive method is applied. In such a case, the TC is bracketed from the right to the left, which results in a poor quality analysis. When considering the island of reliability "northern blot analysis", the correct bracketing is found.

INPUT/OUTPUT FILE FORMATS

Yatea input: TreeTagger output

TODO

-head2 YaTeA XML output

TODO

INPUT FILE

The input is the output of a Part-of-Speech tagger and lemmatizer such as TreeTagger and Flemm.

INPUT FILE FORMATS

The input file contains the morpho-syntactic information and lemma associated to each word. Each line contains 3 information separated by tabulations: the inflected form of the word, its part-of-speech tag and its lemma. Basically, the input format is the TreeTagger output format. For instance:

 Combined       VBN     Combine
 action NN      action
 of     IN      of
 two    CD      two
 transcription  NN      transcription
 factors        NNS     factor
 regulates      VBZ     regulate
 genes  NNS     gene
 encoding       VBG     encode
 spore  NN      spore
 coat   NN      coat
 proteins       NNS     protein
 of     IN      of
 Bacillus       NN      Bacillus
 subtilis       NN      subtilis
 .      SENT    .

OUTPUT FILES

  • xml/candidates.xml

    This file contains all the information on the extracted terms. The share/doc/YaTeA/DTD/yatea.dtd contains the DTD of the XML file.

  • raw/termCandidates.ttg

    This file contains the list of the term candidates in the TreeTagger (each line contains 3 information separated by tabulations: the inflected form of the word, its part-of-speech tag and its lemma).

  • raw/termList.txt

    This contains the list of the term candidates in a tabular format. The columns are the term ID, the inflected form and the lemmatized form of the term, the term frequency, the C-Value, and others ranking metrics. The three last columns are the id of the head, of the modifier, and the main head. Lines starting with # are comment line.

CONFIGURATION FILE FORMAT

Main configuration file (usually yatea.rc)

The configuration file of YaTeA is divided into two sections:

  • Section DefaultConfig

    • CONFIG_DIR : directory containing the configuration files according to the language

    • LOCALE_DIR : directory containing the environment files according to the language

    • RESULT_DIR : directory where are stored the results (probably not useful)

  • Section OPTIONS

    • language language : Definition of the language of the corpus. Values are either FR (French - TreeTagger output - TagSet <http://www.ims.uni-stuttgart.de/~schmid/french-tagset.html>), FR-Flemm (French - output of Flemm analyser or EN (English - TreeTagger or GeniaTagger output - PennTreeBank Tagset)

    • suffix suffix : Specification of a name for the current version of the analysis. Results are gathered in a specific directory of this name and result files also carry this suffix

    • output-path : set the path to the directory that will contain the results for the current corpus (default: working directory)

    • termino File : Name of a file containing a list of testified terms. The testified terms have to provided in the TreeTagger output format.

    • monolexical-all : all occurrences of monolexical phrases are considered as term candidates. The value is 0 or 1.

    • monolexical-included : occurrences of monolexical term candidates that appear in complex term candidates are also displayed. The value is 0 or 1.

    • match-type [loose or strict] :

      • loose : testified terms match either inflected or lemmatized forms of each word

      • strict : testified terms match the combination of inflected form and POS tag of each word

      • unspecified option: testified terms match match inflected forms of words

    • xmlout : display of the parsed term candidates in XML format. The value is 0 or 1.

    • termList : display of a list of terms and sub-terms along with their frequency. To display only term candidates containing more than one word (multi-word term candidates), specify the value multi. All term candidates will be displayed , monolexical and multi-word term candidates with the value all, or if any value is specified.

    • printChunking : displays of the corpus marked with phrases in a HTML file along with the indication that they are term candidates or not. The value is 0 or 1.

    • TC-for-BioLG : annotation of the corpus with term candidates in a XML format compatible with the BioLG software. The value is 0 or 1.

    • TT-for-BioLG : annotation of the corpus with testified terms in a XML format compatible with the BioLG software. The value is 0 or 1. (http://www.it.utu.fi/biolg/, biological tuned version of the Link Grammar Parser)

    • XML-corpus-for-BioLG : creation of a BioLG compatible XML version of the corpus with PoS tags marked form each word. The value is 0 or 1.

    • debug : displays informations on parsed phrases (i.e. term candidates) in a text format. The value is 0 or 1.

    • annotate-only : only annotate testified terms (no acquisition). The value is 0 or 1.

    • TTG-style-term-candidates : term candidates are displayed in TreeTagger output format. Term separator is the sentence boundary tag SENT. To extract only term candidates containing more than one word (multi-word term candidates), specify the option multi. All term candidates will be displayed , monolexical and multi-word term candidates with the value all, or if any value is specified.

Linguistic configuration files

Available in share/YaTeA/config from the archive directory.

TODO

EXAMPLES

TODO

CONTRIBUTORS

  • Charlotte Roze has defined the configuration files to process a corpus tagged with Flemm

  • Wiktoria Golik, Robert Bossy and Claire Nédellec (MIG/INRA) have corrected bugs and improve the mapping of testified terms.

SEE ALSO

Sophie Aubin and Thierry Hamon. Improving Term Extraction with Terminological Resources. In Advances in Natural Language Processing (5th International Conference on NLP, FinTAL 2006). pages 380-387. Tapio Salakoski, Filip Ginter, Sampo Pyysalo, Tapio Pahikkala (Eds). August 2006. LNAI 4139.

AUTHORS

Thierry Hamon <thierry.hamon@limsi.fr> and Sophie Aubin <sophie.aubin@lipn.univ-paris13.fr>

LICENSE

Copyright (C) 2005 by Thierry Hamon and Sophie Aubin

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.