NAME
Lingua::Treebank - Perl extension for manipulating the Penn Treebank format
SYNOPSIS
use Lingua::Treebank;
my @utterances = Lingua::Treebank->from_penn_file($filename);
foreach (@utterances) {
# $_ is a Lingua::Treebank::Const now
foreach ($_->get_all_terminals) {
# $_ is a Lingua::Treebank::Const that is a terminal (word)
print $_->word(), ' ' $_->tag(), "\n";
}
print "\n\n";
}
ABSTRACT
Modules for abstracting out the "natural" objects in the Penn
Treebank format.
DESCRIPTION
This class knows how to read two treebank formats, the Penn format and the Chomsky Normal Form (CNF) format. These formats differ in how they handle terminal nodes. The Penn format places pre-terminal part of speech tags in the left-hand position of a parenthesis-delimited pair, just like it does non-terminal nodes. The CNF format attaches pre-terminal tags to the word with an underscore. For example, the sentence "I spoke" would be rendered in each format as follows:
(S
(NP
(N I))
(VP
(V spoke)))
Penn
(S
(NP
I_N)
(VP
spoke_V))
Chomsky Normal Form
Almost all the interesting tree-functionality is in the constituent-forming package (included in this distribution, see Lingua::Treebank::Const).
PLEASE NOTE: The format expected here is the .mrg
format, not the .psd
format. In other words, one POS-tag per word is required. (In response to CPAN bug 15079.)
Variables
- CONST_CLASS
-
The value
Lingua::Treebank::CONST_CLASS
indicates what class should be used as the class for constituents. The default isLingua::Treebank::Const
; it will generate an error to use a value for $Lingua::Treebank::CONST_CLASS that is not a subclass ofLingua::Treebank::Const
.
Methods
Class methods
- from_penn_file
-
given a Penn treebank file, open it, extract the constituents, and return the roots.
- from_penn_fh
-
given a Penn treebank filehandle, extract the constituents and return the roots.
- from_cnf_file
-
given a Chomsky normal form file, open it, extract the constituents, and return the roots.
- from_cnf_fh
-
given a Chomsky normal form filehandle, extract the constituents and return the roots.
EXPORT
None by default.
HISTORY
- 0.01
-
Original version; created by h2xs 1.22 with options
-CAX Lingua::Treebank
- 0.02
-
Improved documentation.
- 0.03
-
added a VERBOSE variable that can be set.
- 0.09
-
A variety of additional features
- 0.10
-
more features still, also some bugfixes.
- 0.11
-
Removed references to Text::Balanced, which is slow and not uniformly available.
- 0.12
-
Corrected bug in Makefile.PL pointed out by Vassilii Khachaturov.
Added some documentation distinguishing that .mrg (and not .psd files) are supported.
- 0.13
-
text()
method now suppresses anything with a-NONE-
tag.$VERSION
for Lingua::Treebank and Lingua::Treebank::Const now tied. - 0.14
-
Actually include patch intended for 0.13. *sheesh*.
- 0.15
-
Include Lingua::Treebank::HeadFinder class in distro. Modify L::TB::Const to support head-child annotation.
also support 64-bit systems much better.
SEE ALSO
TO DO: mention documentation of Penn Treebank
AUTHOR
Jeremy Gillmor Kahn, <kahn@cpan.org>
COPYRIGHT AND LICENSE
Copyright 2003-2008 by Jeremy Gillmor Kahn with additional support from Bill McNeill
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 250:
You forgot a '=back' before '=head1'