The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Diversity::Utils - utility subroutines for users of classes derived from Lingua::Diversity

VERSION

This documentation refers to Lingua::Diversity::Utils version 0.05.

SYNOPSIS

    use Lingua::Diversity::Utils qw( split_text split_tagged_text );

    my $text = 'of the people, by the people, for the people';

    # Get a reference to an array of words...
    my $word_array_ref = split_text(
        'text'      => \$text,
        'regexp'    => qr{[^a-zA-Z]+},
    );

    # Alternatively, tag the text using Lingua::TreeTagger...
    use Lingua::TreeTagger;
    my $tagger = Lingua::TreeTagger->new(
        'language' => 'english',
        'options'  => [ qw( -token -lemma -no-unknown ) ],
    );
    my $tagged_text = $tagger->tag_text( \$text );

    # ... get a reference to an array of words...
    $word_array_ref = Lingua::Diversity::Utils->split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
    );

    # ... or get a reference to an array of wordforms and an array of lemmas.
    ( $wordform_array_ref, my $lemma_array_ref )= split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
        'category'      => 'lemma',
    );

    # Conditions may be imposed on the selection of tokens...
    ( $wordform_array_ref, $lemma_array_ref )= split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
        'category'      => 'lemma',
        'condition'     => {
            'tag'       => qr{^NNS?$},
        },
    );

DESCRIPTION

This module provides utility subroutines intended to facilitate the use of a class derived from Lingua::Diversity.

SUBROUTINES

split_text()

Split a text into units (typically words), delete empty units, and return a reference to the array of units.

The subroutine takes one required and one optional named parameter.

text (required)

A reference to the text to be split.

regexp

A reference to a regular expression describing unit delimiter sequences. Default is qr{\s+}.

split_tagged_text()

Given a Lingua::TreeTagger::TaggedText object, return a reference to the array of units (e.g. wordforms). Optionally, return a second reference to the array of categories (e.g. lemmas).

The subroutine requires two named parameters and may take up to four of them.

tagged_text (required)

The Lingua::TreeTagger::TaggedText object to be split.

unit (required)

The Lingua::TreeTagger::Token attribute (either original, lemma, or tag) that should be used to build the unit array. NB: make sure the requested attribute is available in the Lingua::TreeTagger::TaggedText object!

category

The Lingua::TreeTagger::Token attribute (either lemma or tag) that should be used to build the category array. NB: make sure the requested attribute is available in the Lingua::TreeTagger::TaggedText object!

condition

A reference to a hash specifying conditional inclusion or exclusion of tokens. The hash may have a mode key, a logical key and up to three keys among original, lemma, and tag:

mode

A string indicating whether the condition specifies which tokens should be included (value include) or excluded (value exclude). Default is include.

logical

A string indicating whether the conditions set with the original, lemma, and tag keys (see below) must all be satisfied (value and) or whether it suffices that one of them be satisfied (value or). Default is and.

original

A regular expression specifying the original attribute of tokens to be in-/excluded.

lemma

A regular expression specifying the lemma attribute of tokens to be in-/excluded.

tag

A regular expression specifying the tag attribute of tokens to be in-/excluded.

DIAGNOSTICS

Missing parameter 'text' in call to subroutine split_text()

This exception is raised when subroutine split_text() is called without a parameter named text (whose value should be a reference to a string).

Missing parameter 'tagged_text' in call to subroutine split_tagged_text()

This exception is raised when subroutine split_tagged_text() is called without a parameter named tagged_text.

Parameter 'tagged_text' in call to subroutine split_tagged_text() must be a Lingua::TreeTagger::TaggedText object

This exception is raised when subroutine split_tagged_text() is called with a parameter named tagged_text whose value is not a Lingua::TreeTagger::TaggedText object.

Missing parameter 'unit' in call to subroutine split_tagged_text()

This exception is raised when subroutine split_tagged_text() is called without a parameter named unit.

Parameter 'unit' in call to subroutine split_tagged_text() must be either 'original', 'lemma', or 'tag'

This exception is raised when subroutine split_tagged_text() is called with a parameter named unit whose value is not original, lemma, or tag.

Parameter 'category' in call to subroutine split_tagged_text() must be either 'lemma' or 'tag'

This exception is raised when subroutine split_tagged_text() is called with a parameter named category whose value is not lemma or tag.

Key 'mode' of hash 'condition' in call to subroutine split_tagged_text() must have value either 'include' or 'exclude'

This exception is raised when subroutine split_tagged_text() is called with a parameter named condition referring to a hash whose key mode has another value than include or exclude.

Key 'logical' of hash 'condition' in call to subroutine split_tagged_text() must have value either 'and' or 'or'

This exception is raised when subroutine split_tagged_text() is called with a parameter named condition referring to a hash whose key mode has another value than and or or.

DEPENDENCIES

This module is part of the Lingua::Diversity distribution. Some subroutines are designed to operate on Lingua::TreeTagger::TaggedText objects.

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2011 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::Diversity, Lingua::TreeTagger.