The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

MaxEntropy - Perl module for Maximum Entropy Modeling

SYNOPSIS

  use Statistics::MaxEntropy;

  # debugging messages; default 0
  $Statistics::MaxEntropy::debug = 0;

  # normalise frequencies in events file; default 1
  $Statistics::MaxEntropy::normalise = 1;

  # maximum number of iterations for IIS; default 100
  $Statistics::MaxEntropy::NEWTON_max_it = 100;

  # minimal distance between new and old x for Newton's method; 
  # default 0.001
  $Statistics::MaxEntropy::NEWTON_min = 0.001;

  # maximum number of iterations for Newton's method; default 100
  $Statistics::MaxEntropy::KL_max_it = 100;

  # minimal distance between new and old x; default 0.001
  $Statistics::MaxEntropy::KL_min = 0.001;

  # configuration of Statistics::MaxEntropy::random_init
  # maximum number of candidates randomly generated
  $Statistics::MaxEntropy::rand_max_candidates = 20;

  # maximum number of features randomly generated
  $Statistics::MaxEntropy::rand_max_features = 20;

  # initialisation for GIS, IIS, or FI
  Statistics::MaxEntropy::init($events_file,
                       $parameters_file,
                       $candidates_file);

  # random initialisation for GIS, IIS, or FI
  Statistics::MaxEntropy::random_init($random_events_file,
                              $random_parameters_file,
                              $random_candidates_file);

  # Generalised Iterative Scaling
  Statistics::MaxEntropy::GIS();

  # Improved Iterative Scaling
  Statistics::MaxEntropy::IIS();

  # Feature Induction algorithm
  Statistics::MaxEntropy::FI($nr_candidates_to_add, $iterative_scaler);

  # writes new events, candidates, and parameters files
  Statistics::MaxEntropy::done($new_events_file,
                       $new_parameters_file,
                       $new_candidates_file,
                       $information_file);

DESCRIPTION

This module is an implementation of the Generalised and Improved Iterative Scaling (GIS, IIS) algorithms and the Feature Induction (FI) algorithm as defined in (Darroch and Ratcliff 1972, (Della Pietra et al. 1997)). The purpose of the scaling algorithms is to find the maximum entropy distribution given a set of events and (optionally) an initial distribution. Also a set of candidate features may be specified; then the FI algorithm may be applied to find and add the candidate feature(s) that give the largest `gain' in terms of Kullback Leibler divergence when it is added to the current set of features.

Events are specified in terms of a set of feature functions (properties) f_1...f_k that map each event to {0,1}: an event is a string of bits. In addition of each event its frequency is given. We assume the event space to have a probability distribution that can be described by

VARIABLES

  • $Statistics::MaxEntropy::debug

    If set to "1", lots of debug information, and intermediate results will be output. Default: "0".

  • $Statistics::MaxEntropy::normalise

    If set to "1", frequencies in the events file are normalised; this is required if they not sum to "1"; default: "1".

  • $Statistics::MaxEntropy::NEWTON_max_it

    Sets the maximum number of iterations in Newton's method. Newton's method is applied to find the new parameters \alpha_i of the features f_i. Default: "100".

  • $Statistics::MaxEntropy::NEWTON_min

    Sets the minimum difference between x' and x in Newton's method; if either the maximum number of iterations is reached or the difference between x' and x is small enough, the iteration is stopped. Default: "0.001".

  • $Statistics::MaxEntropy::KL_max_it

    Sets the maximum number of iterations applied in the IIS algorithm. Default: "100".

  • $Statistics::MaxEntropy::KL_min

    Sets the minimum difference between KL divergences of two distributions in the IIS algorithm; if either the maximum number of iterations is reached or the difference between the divergences is enough, the iteration is stopped. Default: "0.001".

  • $Statistics::MaxEntropy::rand_max_candidates

    Configures Statistics::MaxEntropy::random_init; it sets the maximum number candidate features that are randomly generated for each event. Values "2" and higher are allowed.

  • $Statistics::MaxEntropy::rand_max_features

    Configures Statistics::MaxEntropy::random_init; it sets the maximum number of features that are randomly generated for each event.

    The minimum number of events that are randomly generates is "2". The maximum number of events is bounded from above by 2^{number of features randomly chosen}.

FUNCTIONS

  • Statistics::MaxEntropy::init($events_file, $parameters_file, $candidates_file);

    The events file is required. The candidate and initial parameter files are optional. If the initial parameter file is specified as the empty string then random parameters from [0,1] are chosen.

  • Statistics::MaxEntropy::random_init($random_events_file, $random_parameters_file, $random_candidates_file);

    All three files are required. Your program dies, if you leave one or more unspecified. A caveat:

  • Statistics::MaxEntropy::GIS();

    Calls the Generalised Iterative Scaling algorithm. See (Darroch and Ratcliff 1972).

  • Statistics::MaxEntropy::IIS();

    Calls the Improved Iterative Scaling algorithm. If no events were loaded or generated, your program exits. See (Della Pietra et al. 1997).

  • Statistics::MaxEntropy::FI($nr_candidates_to_add, $iterative_scaler);

    Calls the Feature Induction algorithm. The parameter $nr_candidates_to_add is for the number of candidates it should add. If this number is greater than the number of candidates, all candidates are added. If it is not specified, it "1" is used. The parameter $iterative_scaler can be used to provide one of scaling algorithm GIS or IIS. The references to the sub's should be used as follows:

        $iterative_scaler = \&GIS;
        $nr_candidates_to_add = 2;
        FI($nr_candidates_to_add, $iterative_scaler);

    If it is not specified, the improved algorithm is used.

  • Statistics::MaxEntropy::done($new_events_file, $new_parameters_file, $new_candidates_file, $information_file);

    All parameters are optional. The candidate numbers that were added are printed to the information file. The new events file contains the events extended with the new features, the new candidates file contains the set of candidates minus the candidates added. To the distribution file the new parameters are printed. The information file is used to print the indices of the candidates that were added to. Before a new call of init or random_init, calling Statistics::MaxEntropy::done is a good thing to do.

SYNTAX OF INPUT/OUTPUT FILES

Lines that start with a `#' and empty lines are ignored.

Below we give the syntax of in and output files.

EVENTS FILE (input/output)

Syntax of the event file (n features, and m events); the following holds for features:

  • each line is an event;

  • each column represents a feature function; the co-domain of a feature function is {0,1};

  • no space between feature columns;

  • constant features (i.e. columns that are completely 0 or 1) are forbidden;

  • 2 or more events should be specified (this is in fact a consequence of the previous requirement;

The frequency of each event precedes the feature columns. Features are indexed from right to left. This is a consequence of how Bit::Vector reads bit strings. Each fij is a bit and freqi a float in the following schema:

    freq1 <white> f1n ... f13 f12 f11 <newline>
      .                     .
      .                     .
      .                     .
    freqi <white> fin ... fi3 fi2 fi1 <newline>
      .                     .
      .                     .
      .                     .
    freqm <white> fmn ... fm3 fm2 fm1

(m events, n features)

PARAMETERS FILE (input/output)

Syntax of the initial parameters file; one parameter per line:

    par1 <newline>
     .
     .
     .
    pari <newline>
     .
     .
     .
    parn

(n features)

The syntax of the output distribution is the same.

CANDIDATES FILE (input)

The syntax of the candidate feature file is more or less the same as that for the events file:

  • each line is an event (events specified in the same order as the events file);

  • each column is a feature;

  • constant feature functions are forbidden;

  • values are 0 or 1;

  • no space between features;

fij are bits:

    f1c ... f13 f12 f11 <newline>
             .
             .
             .
    fic ... fi3 fi2 fi1 <newline>
             .
             .
             .
    fmc ... fm3 fm2 fm1

(m events, c candidate features)

INFO FILE (output)

The indices of the candidates added to the field, one per line. Bit strings are always indexed right to left.

BUGS

  • It's slow.

  • Statistics::MaxEntropy communicates through files mainly. The reason for is a practical one: for my own purposes I needed communication through files; I'm using the module for large event bases (corpora), and I'm not interested in (large) arrays that tell me what candidates have been added, what parameters were found, or how the events file looks like in an array. My (other) software components will read the files again.

ENVIRONMENT

No environment variables are used.

AUTHOR

SEE ALSO

perl(1), Bit::Vector, and POSIX.

DIAGNOSTICS

The module dies with an appropriate message if

  • it cannot open a specified events file;

  • if you specified a constant feature function (in the events file or the candidates file); this may happen sometimes if you call Statistics::MaxEntropy::random_init;

  • if Statistics::MaxEntropy::FI is called and no candidates are given;

  • if Statistics::MaxEntropy:IIS is called and no events are given;

  • if the events file, candidates file, or the parameters file is not consistent; causes (a.o.): insufficient features, or too many features for some event; inconsistent candidate lines; insufficient, or to many event lines in the candidates file.

REFERENCES

  • (Darroch and Ratcliff 1972) J. Darroch and D. Ratcliff, Generalised Iterative Scaling for log-linear models, Ann. Math. Statist., 43, 1470-1480, 1972.

  • (Jaynes 1997) E.T. Jaynes, Probability theory: the logic of science, 1997, unpublished manuscript.

  • (Della Pietra, Della Pietra and Lafferty 1997) Stephen Della Pietra, Vincent Della Pietra, and John Lafferty, Inducing features of random fields, In: Transactions Pattern Analysis and Machine Intelligence, 19(4), April 1997.

COPYRIGHT

Statistics::MaxEntropy comes with ABSOLUTELY NO WARRANTY and may be copied only under the terms of the GNU General Public License (version 2, or later), which may be found in the distribution.