The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Lingua::Diversity::Utils - utility subroutines for users of classes derived from Lingua::Diversity

VERSION

This documentation refers to Lingua::Diversity::Utils version 0.03.

SYNOPSIS

    use Lingua::Diversity::Utils qw( split_text split_tagged_text );

    my $text = 'of the people, by the people, for the people';

    # Get a reference to an array of words...
    my $word_array_ref = split_text(
        'text'      => \$text,
        'regexp'    => qr{[^a-zA-Z]+},
    );

    # Alternatively, tag the text using Lingua::TreeTagger...
    use Lingua::TreeTagger;
    my $tagger = Lingua::TreeTagger->new(
        'language' => 'english',
        'options'  => [ qw( -token -lemma -no-unknown ) ],
    );
    my $tagged_text = $tagger->tag_text( \$text );

    # ... get a reference to an array of words...
    $word_array_ref = Lingua::Diversity::Utils->split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
    );

    # ... or get a reference to an array of wordforms and an array of lemmas.
    ( $wordform_array_ref, my $lemma_array_ref )= split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
        'category'      => 'lemma',
    );

    # Conditions may be imposed on the selection of tokens...
    ( $wordform_array_ref, $lemma_array_ref )= split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
        'category'      => 'lemma',
        'condition'     => {
            'tag'       => qr{^NNS?$},
        },
    );

DESCRIPTION

This module provides utility subroutines intended to facilitate the use of a class derived from Lingua::Diversity.

SUBROUTINES

split_text()

Split a text into units (typically words), delete empty units, and return a reference to the array of units.

The subroutine requires one named parameter and may take up to two of them.

text (required)

A reference to the text to be split.

regexp

A reference to a regular expression describing unit delimiter sequences. Default is qr{\s+}.

split_tagged_text()

Given a Lingua::TreeTagger::TaggedText object, return a reference to the array of units (e.g. wordforms). Optionally, return a second reference to the array of categories (e.g. lemmas).

The subroutine requires two named parameters and may take up to four of them.

tagged_text (required)

The Lingua::TreeTagger::TaggedText object to be split.

unit (required)

The Lingua::TreeTagger::Token attribute (either 'original', 'lemma', or 'tag') that should be used to build the unit array. NB: make sure the requested attribute is available in the Lingua::TreeTagger::TaggedText object!

category

The Lingua::TreeTagger::Token attribute (either 'lemma' or 'tag') that should be used to build the category array. NB: make sure the requested attribute is available in the Lingua::TreeTagger::TaggedText object!

condition

A reference to a hash specifying conditional inclusion or exclusion of tokens. The hash may have a 'mode' key, a 'logical' key and up to three keys among 'original', 'lemma', and 'tag':

mode

A string indicating whether the condition specifies which tokens should be included (value 'include') or excluded (value 'exclude'). Default value is 'include'.

logical

A string indicating whether the conditions set with the 'original', 'lemma', and 'tag' keys (see below) must all be satisfied (value 'and') or whether it suffices that one of them is satisfied (value 'or'). Default value is 'and'.

original

A regular expression specifying the 'original' attribute of tokens to be included/excluded.

lemma

A regular expression specifying the 'lemma' attribute of tokens to be included/excluded.

tag

A regular expression specifying the 'tag' attribute of tokens to be included/excluded.

DIAGNOSTICS

Missing parameter 'text' in call to subroutine split_text()

This exception is raised when subroutine split_text() is called without a parameter named 'text' (whose value should be a reference to a string).

Missing parameter 'tagged_text' in call to subroutine split_tagged_text()

This exception is raised when subroutine split_tagged_text() is called without a parameter named 'tagged_text').

Parameter 'tagged_text' in call to subroutine split_tagged_text() must be a Lingua::TreeTagger::TaggedText object

This exception is raised when subroutine split_tagged_text() is called with a parameter named 'tagged_text' whose value is not a Lingua::TreeTagger::TaggedText object.

Missing parameter 'unit' in call to subroutine split_tagged_text()

This exception is raised when subroutine split_tagged_text() is called without a parameter named 'unit').

Parameter 'unit' in call to subroutine split_tagged_text() must be either 'original', 'lemma', or 'tag'

This exception is raised when subroutine split_tagged_text() is called with a parameter named 'unit' whose value is not 'original', 'lemma', or 'tag'.

Parameter 'category' in call to subroutine split_tagged_text() must be either 'lemma' or 'tag'

This exception is raised when subroutine split_tagged_text() is called with a parameter named 'category' whose value is not 'lemma' or 'tag'.

DEPENDENCIES

This module is part of the Lingua::Diversity distribution. Some subroutines are designed to operate on Lingua::TreeTagger::TaggedText objects.

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2011 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::Diversity, Lingua::TreeTagger.