++ed by:

1 PAUSE user

Ken Williams
and 1 contributors


Lingua::CollinsParser - Head-driven syntactic sentence parser


  use Lingua::CollinsParser;
  my $p = Lingua::CollinsParser->new();
  my $cp_home = '/path/to/COLLINS-PARSER';
  $p->load_events( "$cp_home/models/model1/events");
  my @words = qw(The bird flies);
  my @tags  = qw(DT NN VBZ);
  my $tree = $p->parse_sentence(\@words, \@tags);


Syntactic parsing is the act of constructing a phrase-structure tree (or several alternative trees) from a natural-language sentence.

There are many different ways to do this, resulting in lots of different styles of output and using various amounts of space & time resources. One of the most successful recent methods was developed by Michael Collins as part of his 1999 Ph.D. work at the University of Pennsylvania. It uses the notion of "head-driven" statistical models, in which a certain word from each subtree is designated as the "head" of that subtree. It can be very useful to use the head words when analyzing the tree output.

This module, Lingua::CollinsParser, is a Perl wrapper around Collins' parser. The parser itself is written in C.


Because the internal C code of the parser uses lots of global variables to maintain state, it is currently impossible to create more than one parser instance at the same time. Therefore, the class behaves in a "Singleton" manner, i.e. repeated calls to new() will actually return the same parser, not actually new ones.

However, if a cleanup effort is undertaken in the parser's C code in the future, it may be possible to remove its reliance on global variables, and the new() method could start returning new instances with each call. Therefore, please don't rely on future versions of Lingua::CollinsParser behaving as singletons.


The following methods are available in the Lingua::CollinsParser class:


Creates a new Lingua::CollinsParser object and returns it. For initialization, new() accepts a list of key-value pairs corresponding to the five accessor methods below (beamsize, punc_flag, distaflag, distvflag, npflag) - if present, the accessors will be called and the corresponding values will be passed to them.

beamsize( [value] )

A real number specifying the size of the "beam". The beam XXX. Default value is 10000. Smaller numbers like 1000 may be used to increase speed at a slight cost in accuracy.

punc_flag( [value] )

A boolean flag indicating whether to use the "punctuation constraint". A description of this constraint comes from Collins' Ph.D. thesis:

    If for any constituent Z in the chart Z -> <..X Y..> two of its children X and Y are separated by a comma, then the last word in Y must be directly followed by a comma, or must be the last word in the sentence. In training data 96% of commas follow this rule. The rule also has the benefit of improving efficiency by reducing the number of constituents in the chart.

The default is true, i.e. to use the constraint.

distaflag( [value] )

A boolean flag indicating whether the "adjacency condition" in the distance measure should be used. This is explained somewhere in Collins' Ph.D. thesis, though I couldn't quite figure out where. Default is true.

distvflag( [value] )

A boolean flag indicating whether the "verb condition" in the distance measure should be used. This is explained somewhere in Collins' Ph.D. thesis, though I couldn't quite figure out where. Default is true.

npflag( [value] )

A boolean flag indicating whether noun phrases should always include NP and NPB levels, or whether the extra NP level may be omitted when superfluous. The default is to omit, i.e. the flag is true by default. For example, with npflag=1 you may get the following structure:

  (TOP (S (NPB the man) (VP saw (NPB the dog))))

whereas with npflag=0 you might get the following:

  (TOP (S (NP (NPB the man)) (VP saw (NP (NPB the dog)))))

(This example comes from the README in Collins' parser distribution.)


Loads a grammar file (a few sample grammar files ship with Collins' parser distribution) into the parser. This must be done before calling parse_sentence().


Loads a events file (a few sample events files ship with Collins' parser distribution) into the parser. This or undump_events_hash() must be done before calling parse_sentence().

parse_sentence(\@words, \@tags);

Invokes the parser on the given sentence. The first argument must be an array reference containing the words of the sentence. The second argument must be an array reference containing those words' corresponding part-of-speech tags. A Lingua::CollinsParser::Node object is returned, representing a syntax tree for the sentence.

To generate the array of part-of-speech tags, you may be interested in Lingua::BrillTagger, InXight (http://www.inxight.com/), or GATE (http://gate.ac.uk/).


It takes a really long time to call load_events(), so this method is provided to "freeze" the loaded events hash to a file, so that it can be "thawed" out again later with undump_events_hash(). This is much faster. For instance, if during installation you run the regression tests twice in a row, you'll notice that the second time is much faster, because it dumped the hash information the first time.


Loads an events hash from a file that was previously created using dump_events_hash().


Ken Williams, ken.williams@thomson.com


The Lingua::CollinsParser perl interface is copyright (C) 2004 Thomson Legal & Regulatory, and written by Ken Williams. It is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The Collins Parser is copyright (C) 1999 by Michael Collins - you will find full copyright and license information in its distribution. The Parser.patch file distributed here is granted under the same license terms as the parser code itself.




http://www.ai.mit.edu/people/mcollins/code.html (The Collins Parser)