NAME

Lingua::TreeTagger::TaggedText - Storing and manipulating the output of TreeTagger.

VERSION

This documentation refers to Lingua::TreeTagger::TaggedText version 0.01.

SYNOPSIS

    use Lingua::TreeTagger;

    # Create a Tagger object.
    my $tagger = Lingua::TreeTagger->new(
        'language' => 'english',
    );

    # Tag some text and get a new TaggedText object.
    my $tagged_text = $tagger->tag_file( 'path/to/some/file.txt' );

    # A TaggedText object is a sequence of Lingua::TreeTagger::Token objects,
    # which can be stringified as raw text...
    print $tagged_text->as_text();

    # ... or in XML format.
    print $tagged_text->as_XML();
    
    # Token objects may be accessed directly for more specific purposes.
    foreach my $token ( @{ $tagged_text->sequence() } ) {
        print $token->original(), '|', $token->tag(), "\n";
    }

DESCRIPTION

This module is part of the Lingua::TreeTagger distribution. It defines a class for storing and manipulating the output of TreeTagger in an object-oriented way. See also Lingua::TreeTagger and Lingua::TreeTagger:Token.

METHODS

new()

Creates a new TaggedText object. This is normally called by a Lingua::TreeTagger object rather than directly by the user. It requires two arguments:

  1. A reference to a list containing the textual output of TreeTagger: each item in the list is a carriage-return-terminated line containing either (i) exactly one part-of-speech tag and possibly a token and a lemma (tab-delimited) or (ii) an SGML tag.

  2. A reference to the Lingua::TreeTagger object that has generated the previous list.

as_text()
    # Outputs token sequence in standard TreeTagger format.
    print $tagged_text->as_text();

    # Custom formatting.
    print $tagged_text->as_text( {
        'fields'          => [ qw( lemma original ) ],
        'field_delimiter' => q{:},
        'token_delimiter' => q{ },
    } );

Outputs the sequence of tokens in a TaggedText object as raw text. The only (optional) argument is a reference to a hash containing the following optional named parameters:

fields

A reference to the list of token attributes to be included in the output, in the requested appearance order. Three such attributes are supported: original (the original word token), tag (the part-of-speech tag), and lemma (the lemma). Inclusion of other attributes (or attributes not present in the TaggedText because they are not part of the output of the creator TreeTagger object) raises a fatal exception. The value of this parameter defaults to [ qw( original tag lemma ) ], which corresponds to the standard output of TreeTagger.

field_delimiter

The string that will be inserted between token attributes. Defaults to "\t", which corresponds to the standard output of TreeTagger.

token_delimiter

The string that will be inserted between tokens. Defaults to "\n", which corresponds to the standard output of TreeTagger.

NB: if SGML tags are present in the token sequence, they receive no particular formatting beyond the concatenation of the requested token delimiter.

as_XML()
    # Outputs token sequence in XML format.
    print $tagged_text->as_XML();

    # Custom XML formatting (e.g. C<foo bar="men" baz="man">NN</foo>).
    print $tagged_text->as_XML( {
        'element'       => 'foo',
        'attributes'    => {
            'original'      => 'bar',
            'lemma'         => 'baz',
        },
        'content'       => 'tag',
    } ),

Outputs the sequence of tokens in a TaggedText object as a list of XML tags, with one tag per line. The only (optional) argument is a reference to a hash containing the following optional named parameters:

element

The string that will be used as the name of the XML tag. Defaults to 'w'.

attributes

A reference to a hash where (i) each key is a token attribute to be included in the output as an XML attribute and (ii) each value is the desired name for this XML attribute. As with method as_text(), three token attributes are supported: original (the original word token), tag (the part-of-speech tag), and lemma (the lemma). Inclusion of other token attributes (or attributes not present in the TaggedText because they are not part of the output of the creator TreeTagger object) raises a fatal exception. The value of this parameter defaults to { 'lemma' => 'lemma', 'tag' => 'type' }.

content

A string specifying the token attribute that will be used as the content of the XML element. Defaults to 'original'.

NB: if SGML tags are present in the token sequence, they receive no particular formatting.

ACCESSORS

sequence()

Read-only accessor for the sequence of tokens in a TaggedText object. Returns a reference to an array of tokens, and thus should be de-referenced (see "Synopsis").

length()

Read-only accessor for the 'length' attribute of a TaggedText object.

DIAGNOSTICS

Attempt to create TaggedText object without array reference argument

This exception is raised by the class constructor when a new TaggedText object is created without passing a reference to a list of tagged lines.

Attempt to create TaggedText object without reference to creator

This exception is raised by the class constructor when a new TaggedText object is created without passing a reference to the creator TreeTagger object (see "Bugs and limitations").

Attempt to call as_text with empty 'field' parameter

This exception is raised when method as_text() is called with a reference to an empty list as value for parameter 'field'.

Empty attribute names are not allowed

This exception is raised when method as_XML() is called with a value for parameter 'attributes' such that one or more attributes are associated with an empty string.

Attempt to call as_XML with empty 'element' parameter

This exception is raised when method as_XML() is called with an empty string as value for the 'element' parameter.

Unavailable field(s) (...) requested

This exception is raised when the 'fields' parameter of method as_text() or the 'attributes' or 'content' parameters of method as_XML() specify one or more token attributes that are not available for this TaggedText object (because they were not part of the creator TreeTagger object's output).

DEPENDENCIES

This module is part of the Lingua::TreeTagger distribution. It is not intended to be used as an independent module. In particular, it uses module Lingua::TreeTagger::Token, version 0.01.

It requires module Moose and was developed using version 1.09. Please report incompatibilities with earlier versions to the author.

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

The current version has the limitation that every TaggedText object must hold a reference to the TreeTagger object that created it. Methods as_text() and as_XML() use this internally to determine whether token attributes requested to appear in their output are actually available for this TaggedText object. This results in a tight coupling of the TaggedText and TreeTagger classes, which is obviously not desirable. In a future version, I expect to implement a better solution based on Moose's introspection capabilities

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2010 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::TreeTagger, Lingua::TreeTagger::Token