The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

VERSION

version 0.11949 Treex::Tool::Parser::MSTperl - a non-projective dependency natural language parser (pure Perl implementation of the MST parser)

SYNOPSIS

Analysis of a Czech sentence "Martin jde po ulici." ("Martin walks on the street."), in case only the word forms are available (i.e. you do not have a tagger which would provide you with the POS tags and/or lemmas).

In shell (or in any other way):

 # Download the config file
 wget http://ufallab.ms.mff.cuni.cz/tectomt/share/data/models/parser/mst_perl/cs/pdt_form.config
 # Download and ungzip the unlabelled parsing model
 wget http://ufallab.ms.mff.cuni.cz/tectomt/share/data/models/parser/mst_perl/cs/pdt_form.model.gz
 gunzip pdt_form.model.gz
 # Download and ungzip the deprel labelling model
 wget http://ufallab.ms.mff.cuni.cz/tectomt/share/data/models/parser/mst_perl/cs/pdt_form.lmodel.gz
 gunzip pdt_form.lmodel.gz

(the pdt_form model uses only the wordforms to build dependency trees)

In Perl:

 # the words = child nodes
 my @words = (['Martin'], ['jde'], ['po'], ['ulici'], ['.']);
 # potential parent nodes
 my @words_with_root = @words;
 unshift @words_with_root, ['ROOT'];
 # i.e. @words_with_root = (['ROOT'], ['Martin'], ['jde'], ['po'], ['ulici'], ['.']);

 use Treex::Tool::Parser::MSTperl;

 # Initialize MSTperl
 my $mstperl = Treex::Tool::Parser::MSTperl->new( model_name => 'pdt_form' );

 # Parse the sentence - returns (ArrayRef[Int], ArrayRef[Str]])
 my ($parents, $deprels) = $mstperl->parse_labelled( \@words );

 # Let's see what we got:
 print "child -> parent (deprel):\n------------------------\n";
 for (my $i = 0; $i < @words; $i++) {
    my $word = $words[$i]->[0];
    my $parent_ord = $parents->[$i];
    my $parent = $words_with_root[$parent_ord]->[0];
    my $deprel = $deprels->[$i];
    print "$word -> $parent ($deprel)\n";
 }

This should return:

 child -> parent (deprel):
 ------------------------
 Martin -> jde (Sb)
 jde -> ROOT (Pred)
 po -> jde (AuxP)
 ulici -> po (Adv)
 . -> ROOT (AuxK)

which is the correct parse tree with correct deprels assigned.

DESCRIPTION

This is a Perl implementation of the MST Parser described in McDonald et al.: Non-projective Dependency Parsing using Spanning Tree Algorithms, 2005, in Proc. HLT/EMNLP.

Treex::Tool::Parser::MSTperl contains an unlabelled parser (Treex::Tool::Parser::MSTperl::Parser) and a dependency relation (deprel) labeller (Treex::Tool::Parser::MSTperl::Labeller), which, if chained together, provide a labelled dependency parser.

The Treex::Tool::Parser::MSTperl package serves as a wrapper for the underlying packages and should be sufficient for the basic tasks. For any special needs, feel free to use the underlying packages directly.

Please note that the parser does non-projective parsing and is therefore best for parsing of non-projective languages (e.g. Czech or Dutch). Projective languages (e.g. English) can be parsed by MSTperl as well, but non-projective edges can sometimes appear in the output. To do real projective parsing, it would be necessary to change the core algorithm of the parser (Eisner would have to be used in stead of Chu-Liu-Edmonds).

Please note that the parser does dependency parsing, producing a dependency tree as its output. The parser cannot be used to produce phrase-structure trees.

Models necessary for these tools can be downloaded from http://ufallab.ms.mff.cuni.cz/tectomt/share/data/models/mst_perl_parser/. The .model files are unlabelled parsing models and .lmodel are labelling models. Many models for Czech and a few models for English are provided.

If you have a dependency treebank, you can train your own model - see Treex::Tool::Parser::MSTperl::TrainerLabelling and Treex::Tool::Parser::MSTperl::TrainerUnlabelled. The parameters and the feature set in the .config files are tuned for parsing of Czech language, so doing a little tuning might be helpful when parsing other languages (all of the necessary settings can be done in the config file - see Treex::Tool::Parser::MSTperl::Config).

No models are currently provided for languages other than Czech or English. If you want to use the parser for another language, you have to train your own model.

METHODS

my $mstperl = Treex::Tool::Parser::MSTperl->new( model_dir => '.', model_name => 'pdt_form' );

Creates an instance of MSTperl, capable of parsing sentences, using the config file model_dir/model_name.config (required), the unlabelled parsing model file model_dir/model_name.model (required) and the labelling model file model_dir/model_name.lmodel (required only for labelled parsing). The required files can be downloaded from http://ufallab.ms.mff.cuni.cz/tectomt/share/data/models/mst_perl_parser/; or, you can create your own config file, train your own model(s) following your config and use these files for parsing.

The model_dir parameter is optional and defaults to . (i.e. the current directory). The model_name parameter is required.

my ($parents, $deprels) = $mstperl->parse_labelled($sentence);

Performs labelled parsing of the sentence.

The sentence is represented as (a reference to) an array of words of the sentence. A word is represented as (a reference to) an array of fields, required by the config. I.e. if you look into the config, you will find e.g.:

 field_names:
  - form
  - lemma
  - coarse_tag
  - parent_ord
  - afun

These are the fields used by the models. Their meaning depends on the treebank used for training the models. We typically used PDT for Czech models and CoNLL for English models. (The coarse tag often stands for the first two characters of the full POS tag. For Czech, the coarse tag devised by Collins is used.)

The fields specified in the config file as the parent_ord and the label, e.g.:

 parent_ord: parent_ord
 label: afun

are the fields computed by the unlabelled parser (parent_ord) and the labeller (label). Obviously these are not to be specified on the input.

A sentence "The sheep eat grass." to be parsed by using such a config would be then represented e.g. as:

 $sentence = [
    ["The", "the", "DT"],
    ["sheep", "sheep", "NN"],
    ["eat", "eat", "VB"],
    ["grass", "grass", "NN"],
    [".", ".", "."],
 ];

MSTperl returns two array refs. The first one describes the dependency tree structure by listing a parent node for each word of the sentence, represented by an integer. The numbering of the parents is 1-based, 0 standing for the artificial root node. The second one contains deprels assigned to the words (or, to be more accurate, to the edges between each word and its parent), as strings.

my $parents = $mstperl->parse_unlabelled($sentence);

Similar to parse_labelled(), but only unlabelled parsing is performed (a labelling model is not used) and only the parents are returned.

AUTHORS

Rudolf Rosa <rosa@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2012 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.