The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Treex::Tool::Parser::MSTperl::ModelLabelling

VERSION

version 0.08055

DESCRIPTION

This is an in-memory represenation of a labelling model, extended from Treex::Tool::Parser::MSTperl::ModelBase.

FIELDS

Inherited from base package

Fields inherited from Treex::Tool::Parser::MSTperl::ModelBase.

config

Instance of Treex::Tool::Parser::MSTperl::Config containing settings to be used for the model.

Currently the settings most relevant to the model are the following:

EM_EPSILON

See "EM_EPSILON" in Treex::Tool::Parser::MSTperl::Config.

labeller_algorithm

See "labeller_algorithm" in Treex::Tool::Parser::MSTperl::Config.

labelledFeaturesControl

See "labelledFeaturesControl" in Treex::Tool::Parser::MSTperl::Config.

SEQUENCE_BOUNDARY_LABEL

See "SEQUENCE_BOUNDARY_LABEL" in Treex::Tool::Parser::MSTperl::Config.

featuresControl

Provides access to labeller features, especially enabling their computation. Intance of Treex::Tool::Parser::MSTperl::FeaturesControl.

Label scoring

emissions

Emission scores for Viterbi. They follow the edge-based factorization and provide scores for various labels for an edge based on its features.

The structure is:

  emissions->{feature}->{label} = score

Scores may or may not be probabilities, based on the algorithm used. Also based on the algorithm they may be MIRA-computed or they might be obtained by standard MLE.

transitions

Transition scores for Viterbi. They follow the first order Markov chain edge-based factorization and provide scores for various labels for an edge probably based on its features and always based on previous edge label.

Scores may or may not be probabilities, based on the algorithm used. Also based on the algorithm they may be obtained by standard MLE or they might be MIRA-computed.

The structure is:

  transitions->{label_prev}->{label_this} = prob

or

  transitions->{feature}->{label_prev}->{label_this} = score

Transitions smoothing

In some algorithms linear combination smoothing is used for transition probabilities. The resulting transition probability is then obtained as:

 PROB(label|prev_label) =
    smooth_bigrams  * transitions->{prev_label}->{label} +
    smooth_unigrams * unigrams->{label} +
    smooth_uniform
smooth_bigrams
smooth_unigrams
smooth_uniform

The actual smoothing parameters computed by EM algorithm. Each of them is between 0 and 1 and together they sum up to 1.

uniform_prob

Unifrom probability of a label, computed as 1 / ( keys %{ $self-unigrams } )>.

Set in compute_smoothing_params.

unigrams

Basic MLE from data, the structure is

 unigrams->{label} = prob

To be used for transitions smoothing and/or backoff (can be used both for emissions and transitions) It also contains the SEQUENCE_BOUNDARY_LABEL prob (the SEQUENCE_BOUNDARY_LABEL is counted once for each sequence) which might be unappropriate in some cases (eg. for emission probs).

EM_heldout_data

Just an array ref with the sentences that represent the heldout data to be able to run the EM algorithm in prepare_for_mira(). Used only in training.

METHODS

Inherited

Subroutines inherited from Treex::Tool::Parser::MSTperl::ModelBase.

Load and store

store

See "store" in Treex::Tool::Parser::MSTperl::ModelBase.

store_tsv

See "store_tsv" in Treex::Tool::Parser::MSTperl::ModelBase.

load

See "load" in Treex::Tool::Parser::MSTperl::ModelBase.

load_tsv

See "load_tsv" in Treex::Tool::Parser::MSTperl::ModelBase.

Overriden

Subroutines overriding stubs in Treex::Tool::Parser::MSTperl::ModelBase.

Load and store

$data = get_data_to_store(), $data = get_data_to_store_tsv()

Returns the model data, containing the following fields: unigrams, transitions, emissions, smooth_uniform, smooth_unigrams, smooth_bigrams, uniform_prob

load_data($data), load_data_tsv($data)

Tries to get all necessary data from $data (see get_data_to_store to see what data are stored). Also does basic checks on the data, eg. for non-emptiness, but nothing sophisticated. Is algorithm-sensitive, i.e. if some data are not needed for the algorithm used, they do not have to be contained in the hash.

Training support

prepare_for_mira

Called after preprocessing training data, before entering the MIRA phase.

Function varies depending on algorithm used. Usually recomputes counts stored in emissions, transitions and unigrams to probabilities that have been computed by add_emission, add_transition and add_unigram. Also calls compute_smoothing_params to estimate smoothing parameters for smoothing of transition probabilities.

get_feature_count

Only to provide information about the model. Returns number of features in the model (where a "feature" can stand for various things depending on the algorithm used).

Technical methods

BUILD
 my $model = Treex::Tool::Parser::MSTperl::ModelLabelling->new(
    config => $config,
 );

Creates an empty model. If you are training the model, this is probably what you want, otherwise you can use load or load_tsv to load an existing labelling model from a file.

However, most often you would probably use a model for a labeller (Treex::Tool::Parser::MSTperl::Labeller) or a labelling trainer (Treex::Tool::Parser::MSTperl::TrainerLabelling) which both automatically create the model on build. The labeller also provides wrapping methods "load_model" in Treex::Tool::Parser::MSTperl::Labeller and "load_model_tsv" in Treex::Tool::Parser::MSTperl::Labeller which you can call to load the model from a file. (Btw. as you might expect, the trainer provides methods "store_model" in Treex::Tool::Parser::MSTperl::TrainerLabelling and "store_model_tsv" in Treex::Tool::Parser::MSTperl::TrainerLabelling.)

MLE on training data

emissions and transitions can be either MIRA-trained or estimated directly from training data using MLE (Maximum Likelihood Estimate). unigrams are always estimated by MLE.

add_unigram ($label)

Increment count for the label in unigrams.

add_transition ($label_this, $label_prev)
add_transition ($label_this, $label_prev, $feature)

Increment count for the transition in transitions, possible including a feature on "this" edge if the algorithm uses features with transitions.

add_emission ($feature, $label)

Increment count for this label on an edge with this feature in emissions.

compute_probs_from_counts ($self->emissions)

Takes a hash reference with label counts and chnages the counts to probabilities (this is the actual MLE). May be called in prepare_for_mira on emissions, transitions and unigrams.

EM algorithm

compute_smoothing_params()

The main method containing an implementation of the Expectation Maximization Algorithm to compute smoothing parameters (smooth_bigrams, smooth_unigrams, smooth_uniform) for transition probabilities smoothing by linear combination of bigram, unigram and uniform probability. Iteratively tries to find such parameters that the probabilities from training data (transitions, unigrams and uniform_prob) combined together by the smoothing parameters model well enough the heldout data (EM_heldout_data), i.e. tries to maximize the probability of the heldout data given the training data probabilities by adjusting the smoothing parameters values.

Uses EM_EPSILON as a stopping criterion, i.e. stops when the sum of absolute values of changes to all smoothing parameters are lower than the value of EM_EPSILON.

count_expected_counts_all()
count_expected_counts_tree($root_node)
count_expected_counts_sequence($labels_sequence)

Support methods to compute_smoothing_params, in the order in which they call each other.

Scoring

A bunch of methods to score the likelihood of a label being assigned to an edge based on the edge's features and the label assigned to the previous edge.

get_all_labels()

Returns (a reference to) an array of all labels found in the training data (eg. ['Subj', 'Obj', 'Atr']).

get_label_score($label, $label_prev, $features)

Computes a score of assigning the given label to an edge, given the features of the edge and the label assigned to the previous edge.

Always a higher score means a more likely label for the edge. Some algorithms may give a negative score.

Is semantically equivalent to calling get_emission_score and get_transition_score and then combining it together somehow.

get_emission_score($label, $feature)

Computes the "emission score" of assigning the given label to an edge, given one of the feature of the edge and disregarding the label assigned to the previous edge.

get_transition_score($label_this, $label_prev, $feature)

Computes the "transition score" of assigning the given label to an edge, given the label assigned to the previous edge and possibly also one of the features of the edge but NOT including the emission score returned by get_emission_score.

$result = get_transition_probs_array ($label_this, $label_prev)

Returns (a reference to) an array of the probabilities of the transition from label_prev to label_this (to be smoothed together), having the following structure:

    $result->[0] = uniform prob
    $result->[1] = unigram prob
    $result->[2] = bigram prob
$result = get_emission_scores($features)

Get scores of assigning each of the possible labels to an edge based on all the features of the edge. Is semantically equivalent to doing:

 foreach label
    foreach feature
        get_emission_score(label, feature)

The structure is:

 $result->{label} = score

Actually only serves as a switch for several implementations of the method (get_emission_scores_basic_MIRA and get_emission_scores_no_MIRA); the method to be used is selected based on the algorithm being used.

get_emission_scores_basic_MIRA($features)

A get_emission_scores implementation used with algorithms where the emission scores are computed by MIRA (this is currently the most successful implementation).

get_emission_scores_no_MIRA($features)

A get_emission_scores implementation using only MLE. Probably obsolete now.

Changing the scores

Methods used by the trainer (Treex::Tool::Parser::MSTperl::TrainerLabelling) to adjust the scores to whatever seems to be the best idea at the moment. Used only in MIRA training (MLE uses add_unigram, add_emission, add_transition and compute_probs_from_counts instead).

set_feature_score($feature, $score, $label, $label_prev)

Sets the specified emission score (if label_prev is not set) or transition score (if it is) to the given value ($score).

update_feature_score($feature, $update, $label, $label_prev)

Updates the specified emission score (if label_prev is not set) or transition score (if it is) by the given value ($update), i.e. adds that value to the current value.

AUTHORS

Rudolf Rosa <rosa@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.