- COPYRIGHT AND LICENSE
Handles the configuration of the parser.
Fields describing fields used with nodes, such as form, pos, lemma...
- field_names (ArrayRef[Str])
Field names (for conversion of field index to field name)
- field_names_hash (HashRef[Str])
1 for each field name to easily check if a field name exists
- field_indexes (HashRef[Str])
Index of each field name in field_names (for conversion of field name to field index)
Most of the settings are set by a config file in YAML format. However, you do not have to understand YAML to be able to change the settings provided that you keep things like formating of the file unchanged (some whitespaces are significant etc.). Actually only a subset of all all that YAML provides is used.
Contents of a line from the # character till the end of the line are comments and are ignored (if you need to actually use the # sign, you can quote it - eg.
'#empty#' is interpreted as
#empty#). Lines that contain only whitespace chars or are empty are ignored as well.
Some of the settings are ignored when in parsing mode (i.e. not training). These are use_edge_features_cache (turned off) and number_of_iterations (irrelevant).
These are settings which are acquired from the configuration file:
Lowercase names of fields in the input file (the data fields are to be separated by tabs in the input file). Use [a-z0-9_] only, using always at least one letter. Use unique names, i.e. devise some names even for unused fields.
Field values to set for the (technical) root node.
Name of field containing ord of the parent of the node (also called "head" or "governing node").
Buckets to use for
distance()function (positive integers in any order). Each distance gets bucketed in the highest lower bucket (absolute-value-wise).
distance_buckets: - 1 - 2 - 3 - 4 - 5 - 11
Features to be computed on data.
Features for the unlabelled parser are set under
features, the labeller features under
Use the (lowercase) input file field names (e.g.
pos) to use the field of the (child) node, uppercase them (e.g.
POS) to use the field of the parent, joined together by the
| sign to form the features (e.g.
Prefix the field names by
2. to use the field on the first or second node in the sentence - based on their order in the sentence, regardless of which is parent and which is child (e.g.
1.pos for pos of first of the nodes).
There are also several predefined functions that you can make use of. Usually you can write the function name in lowercase to invoke them on the child field, uppercase for parent, or prefixed by
2. for first or second node (e.g.
CHILDNO() to get the number of parent node's children). The parameter of a function must be a (child) field name, or an integer (as the
bucketed ord-wise distance of child and parent:
parent - child attachement direction:
signum(ORD minus ord)
value of the specified field on the ord-wise preceding node (use
PRECEDING(field)to get field on node preceding the PARENT)
value of the specified field on the ord-wise following node
value of the specified field for each node which is ord-wise between the child node and the parent node
1if the value of
field1is the same as the value of
field2. For fields with multiple values, it has the meaning of an "exists" operator: it returns
1if there is at least one pair of values of each field that are the same.
0if the values don't match.
-1if (at least) one of the vaues is
undef(may be also represented by an empty string)
field1is taken from parent node and
field2from child node
equalspcbut looks at the given position (1 character) in the given field
substring of field value beginning at given start position (0-based) of given length; standard substr behaviour, i.e. both start and length can be negative and length can be omitted, feature function to be then written as
array_field's value is an array of values separated by single spaces (' '), index_field's value is a zero-based index of a value in the array to be returned (used e.g. for tree distance)
returns 1 if node is the first in the sentence, 0 otherwise
1if node is the last in the sentence,
1if node is the first child of its parent,
1if node is the last child of its parent,
returns number of node's children
is the rightmost of all left children of its parent
is the leftmost of all right children of its parent
label of parent (to be used only in labeller features); label is somewhat special, it cannot be used as
Features containing the
LABEL()function are dynamic, i.e. they cannot be precomputed and are always computed just at the time they are needed.
label of previous sibling (to be used only in labeller features); prevlabel is somewhat special, it cannot be used as
Features containing the
prevlabel()function are dynamic, i.e. they cannot be precomputed and are always computed just at the time they are needed.
These settings are probably better left as they are, but it might be advantageous to have the ability of changing them sometimes, especially when experimenting.
You can set the values in various ways. The order of priorities is:
- 1 set in runtime
i.e. set after having created a new Config object:
my $config = Treex::Tool::Parser::MSTperl::Config->new( config_file => 'my_config.config'); $config->DEBUG(4);
The value is only valid from the time of setting.
- 2 set in config file
in the perl script:
my $config = Treex::Tool::Parser::MSTperl::Config->new( config_file => 'my_config.config');
- 3 set in the constructor
i.e. set while creating a new Config object:
# DEBUG: 0
in the perl script:
my $config = Treex::Tool::Parser::MSTperl::Config->new( config_file => 'my_config.config', DEBUG => 4 );
For the setting to take effect, you must not set another value in the config file (you can comment out setting it with '#').
- 4 the default value
Used if the value is not set in runtime, in constructor or in the config file.
Please note that setting some of the values at runtime might not be a good idea.
The options are listed here together with their defaults.
- DEBUG: 0
An integer specifying how much debug information you will be getting while running the program, ranging from 0 (no debug info) through 1 (progress messages) through 2, 3 and 4 to 5 (more and more debug info).
If you set this value to something higher than 1, you should always redirect the output to a file as printing it to the console is very very slow (and there is so much info that you wouldn't be able to read anything anyway).
The possibility to change the value while running the program might be beneficial e.g. if you only want to debug only a particular part of the program.
- number_of_iterations: 3, labeller_number_of_iterations: 3
How many times the trainer (Tagger::MSTperl::Trainer) should go through all the training data.
- use_edge_features_cache: 0, labeller_use_edge_features_cache: 0
Currently deprecated, unmaintained and probably to be removed.
Turns on and off using the
Using cache should be turned on (
1) if training with a lot of RAM or on small training data, as it uses a lot of memory but speeds up the training greatly (approx. by 30% to 50%). If you need to save RAM, turn it off (
- labeller_algorithm: 16
Algorithm used for Viterbi labelling as well as for training. Several possibilities were tried out, especially regarding the emission probabilities used in the Viterbi algorithm; this is for development purposes only, preferebly do not use.
- (0) MIRA-trained weights
recomputed by +abs(min) and converted to probs, transitions by MLE on labels
- (1) dtto, NOT converted to probs
should be same as 0
- (2) dtto, sum in Viterbi instead of product
new_prob = old_prob + emiss*trans
- (3) dtto, no recompution
just strip <= 0
- (4) basic MLE
no MIRA, no smoothing, uniform feature weights blind (unigram) transition backoff, blind emission backoff (but should not be necessary)
- (5) full Viterbi
dtto, transition probs lambda smoothing by EM
- (8) MIRA for all
completely new, based on reading, no MLE, MIRA for all, same features for label unigrams and label bigrams
- (9) dtto, initialize emissions and transitions by MLE
- (10) 0 + fixed best state selection
- (11) 10 + tries to use all possible labels
- (12) 10 + EM for smoothing of transitions
- (13) 11 + EM for smoothing of transitions
- (14) 10 + update uses transition probs as well
- (15) 12 + update uses transition probs as well
- (16) 8 + transitions by MLE & EM on label pairs
multiplied with emission score in Viterbi and added to last state score
- (17) dtto, different transition computation for negative scores
- (18) 16 + no Viterbi summing
- (19) 16, better formula for combining emissions and transitions
- (20) MIRA for all
- (21) MIRA for all, with Viterbi
- (22) MIRA for all, sentence = one sequence (disregarding tree structure)
- SEQUENCE_BOUNDARY_LABEL: '###'
This is only a technical thing; a label must be assigned to the (basically virtual) boundary of a sequence, different from any label used in the data. The default value is '###', so if you use this exact label as a valid label in your data, change the setting to something else. If nothing goes wrong, you should never see this label in the output; however, it is contained in the model and used for "transition scores" to score the "transition" between the sequence boundary and the first/last node (i.e. it determines the scores of labels used as the first or last label in the sequence where no actual transition takes place and the transition scores would otherwise get ignored).
Number of states to keep when pruning. The pruning takes place after each Viterbi step (i.e. after each computation of possible labels and their scores for one edge). For more details see the
- EM_EPSILON: 0.00001
Stopping criterion of EM algorithm which is used to compute smoothing parameters for linear combination smoothing of transition probabilities in some variants of the Labeller. (when the sum of change of smoothing parameters is lower than the epsilon, the algorithm stops).
- EM_heldout_data_at: 0.9
A number between 0 and 1 specifying where in training data do heldout data for EM algorithm start (eg. 0.75 means that first 75% of sentences are training data and the last 25% are heldout data).
The training/heldout data division only affects computation of transition probabilities by MLE, it does not affect MIRA training or MLE for emission probabilities.
If EM is not used for smoothing, all data are used as training data.
Provide access to things needed in more than one of the other packages.
Provides access to unlabelled features, especially enabling their computation. Intance of Treex::Tool::Parser::MSTperl::FeaturesControl.
Provides access to labeller features, especially enabling their computation. Intance of Treex::Tool::Parser::MSTperl::FeaturesControl.
The best source of information about all the possible settings is the configuration file itself (usually called
config.txt), as it is richly commented and accompanied by real examples at the same time.
- my $config = Treex::Tool::Parser::MSTperl::Config->new(config_file => 'file.config')
Reads the configuration file (in YAML format) and applies the settings.
- field_name2index ($field_name)
Fields are referred to by names in the config files but by indexes in the code. Therefore this conversion function is necessary; the other direction of the conversion is ensured by the
Rudolf Rosa <email@example.com>
Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.