Lingua::TreeTagger::TaggedText - Storing and manipulating the output of TreeTagger.
This documentation refers to Lingua::TreeTagger::TaggedText version 0.01.
use Lingua::TreeTagger; # Create a Tagger object. my $tagger = Lingua::TreeTagger->new( 'language' => 'english', ); # Tag some text and get a new TaggedText object. my $tagged_text = $tagger->tag_file( 'path/to/some/file.txt' ); # A TaggedText object is a sequence of Lingua::TreeTagger::Token objects, # which can be stringified as raw text... print $tagged_text->as_text(); # ... or in XML format. print $tagged_text->as_XML(); # Token objects may be accessed directly for more specific purposes. foreach my $token ( @{ $tagged_text->sequence() } ) { print $token->original(), '|', $token->tag(), "\n"; }
This module is part of the Lingua::TreeTagger distribution. It defines a class for storing and manipulating the output of TreeTagger in an object-oriented way. See also Lingua::TreeTagger and Lingua::TreeTagger:Token.
new()
Creates a new TaggedText object. This is normally called by a Lingua::TreeTagger object rather than directly by the user. It requires two arguments:
A reference to a list containing the textual output of TreeTagger: each item in the list is a carriage-return-terminated line containing either (i) exactly one part-of-speech tag and possibly a token and a lemma (tab-delimited) or (ii) an SGML tag.
A reference to the Lingua::TreeTagger object that has generated the previous list.
as_text()
# Outputs token sequence in standard TreeTagger format. print $tagged_text->as_text(); # Custom formatting. print $tagged_text->as_text( { 'fields' => [ qw( lemma original ) ], 'field_delimiter' => q{:}, 'token_delimiter' => q{ }, } );
Outputs the sequence of tokens in a TaggedText object as raw text. The only (optional) argument is a reference to a hash containing the following optional named parameters:
fields
A reference to the list of token attributes to be included in the output, in the requested appearance order. Three such attributes are supported: original (the original word token), tag (the part-of-speech tag), and lemma (the lemma). Inclusion of other attributes (or attributes not present in the TaggedText because they are not part of the output of the creator TreeTagger object) raises a fatal exception. The value of this parameter defaults to [ qw( original tag lemma ) ], which corresponds to the standard output of TreeTagger.
original
tag
lemma
[ qw( original tag lemma ) ]
field_delimiter
The string that will be inserted between token attributes. Defaults to "\t", which corresponds to the standard output of TreeTagger.
"\t"
token_delimiter
The string that will be inserted between tokens. Defaults to "\n", which corresponds to the standard output of TreeTagger.
"\n"
NB: if SGML tags are present in the token sequence, they receive no particular formatting beyond the concatenation of the requested token delimiter.
as_XML()
# Outputs token sequence in XML format. print $tagged_text->as_XML(); # Custom XML formatting (e.g. C<foo bar="men" baz="man">NN</foo>). print $tagged_text->as_XML( { 'element' => 'foo', 'attributes' => { 'original' => 'bar', 'lemma' => 'baz', }, 'content' => 'tag', } ),
Outputs the sequence of tokens in a TaggedText object as a list of XML tags, with one tag per line. The only (optional) argument is a reference to a hash containing the following optional named parameters:
element
The string that will be used as the name of the XML tag. Defaults to 'w'.
'w'
attributes
A reference to a hash where (i) each key is a token attribute to be included in the output as an XML attribute and (ii) each value is the desired name for this XML attribute. As with method as_text(), three token attributes are supported: original (the original word token), tag (the part-of-speech tag), and lemma (the lemma). Inclusion of other token attributes (or attributes not present in the TaggedText because they are not part of the output of the creator TreeTagger object) raises a fatal exception. The value of this parameter defaults to { 'lemma' => 'lemma', 'tag' => 'type' }.
{ 'lemma' => 'lemma', 'tag' => 'type' }
content
A string specifying the token attribute that will be used as the content of the XML element. Defaults to 'original'.
'original'
NB: if SGML tags are present in the token sequence, they receive no particular formatting.
sequence()
Read-only accessor for the sequence of tokens in a TaggedText object. Returns a reference to an array of tokens, and thus should be de-referenced (see "Synopsis").
length()
Read-only accessor for the 'length' attribute of a TaggedText object.
This exception is raised by the class constructor when a new TaggedText object is created without passing a reference to a list of tagged lines.
This exception is raised by the class constructor when a new TaggedText object is created without passing a reference to the creator TreeTagger object (see "Bugs and limitations").
This exception is raised when method as_text() is called with a reference to an empty list as value for parameter 'field'.
This exception is raised when method as_XML() is called with a value for parameter 'attributes' such that one or more attributes are associated with an empty string.
This exception is raised when method as_XML() is called with an empty string as value for the 'element' parameter.
This exception is raised when the 'fields' parameter of method as_text() or the 'attributes' or 'content' parameters of method as_XML() specify one or more token attributes that are not available for this TaggedText object (because they were not part of the creator TreeTagger object's output).
This module is part of the Lingua::TreeTagger distribution. It is not intended to be used as an independent module. In particular, it uses module Lingua::TreeTagger::Token, version 0.01.
It requires module Moose and was developed using version 1.09. Please report incompatibilities with earlier versions to the author.
There are no known bugs in this module.
Please report problems to Aris Xanthos (aris.xanthos@unil.ch)
Patches are welcome.
The current version has the limitation that every TaggedText object must hold a reference to the TreeTagger object that created it. Methods as_text() and as_XML() use this internally to determine whether token attributes requested to appear in their output are actually available for this TaggedText object. This results in a tight coupling of the TaggedText and TreeTagger classes, which is obviously not desirable. In a future version, I expect to implement a better solution based on Moose's introspection capabilities
Aris Xanthos (aris.xanthos@unil.ch)
Copyright (c) 2010 Aris Xanthos (aris.xanthos@unil.ch).
This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Lingua::TreeTagger, Lingua::TreeTagger::Token
To install Lingua::TreeTagger, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::TreeTagger
CPAN shell
perl -MCPAN -e shell install Lingua::TreeTagger
For more information on module installation, please visit the detailed CPAN module installation guide.