The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Lingua::Diversity - Measuring diversity of text units

VERSION

This documentation refers to Lingua::Diversity version 0.03.

SYNOPSIS

    use Lingua::Diversity::MTLD;
    use Lingua::Diversity::Utils qw( split_text split_tagged_text );

    my $text = 'of the people, by the people, for the people';

    # Create a Diversity object (here using method 'MTLD')...
    my $diversity = Lingua::Diversity::MTLD->new();

    # Given some text, get a reference to an array of words...
    my $word_array_ref = split_text(
        'text'      => \$text,
        'regexp'    => qr{[^a-zA-Z]+},
    );

    # Measure lexical diversity...
    my $result = $diversity->measure( $word_array_ref );
    
    # Display results...
    print "Lexical diversity:       ", $result->get_diversity(), "\n";
    print "Variance:                ", $result->get_variance(),  "\n";

    # Tag text using Lingua::TreeTagger...
    use Lingua::TreeTagger;
    my $tagger = Lingua::TreeTagger->new(
        'language' => 'english',
        'options'  => [ qw( -token -lemma -no-unknown ) ],
    );
    my $tagged_text = $tagger->tag_text( \$text );

    # Get references to an array of wordforms and an array of lemmas...
    my ( $wordform_array_ref, $lemma_array_ref ) = split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
        'category'      => 'lemma',
    );

    # Measure morphological diversity...
    $result = $diversity->measure_per_category(
        $wordform_array_ref,
        $lemma_array_ref,
    );

    # Display results...
    print "Morphological diversity: ", $result->get_diversity(), "\n";
    print "Variance:                ", $result->get_variance(),  "\n";

DESCRIPTION

This distribution provides a simple object-oriented interface for applying various measures of diversity to text units. At present, the only implemented measure is MTLD (see Lingua::Diversity::MTLD), but there's more to come.

Note that the Lingua::Diversity class is meant to serve as a base class for classes such as Lingua::Diversity::MTLD, which implement specific diversity measures. Clients should always instantiate the specific classes instead of this one (see "SYNOPSIS").

METHODS

measure()

Apply the selected diversity measure and return the result in a new Lingua::Diversity::Result object.

The method requires a reference to a non-empty array of text units (typically words) as argument.

Units should be in the text's order, since some measures (e.g. MTLD) take it into account. Specific measures may set conditions on the minimal or maximal number of units and raise exceptions when these conditions are not met (see subroutine _validate_size() in Lingua::Diversity::Internals).

The Lingua::Diversity::Utils module contained within this distribution provides tools for helping with the creation of the array of units.

measure_per_category()

Apply the selected diversity measure per category and return the result in a new Lingua::Diversity::Result object. For instance, units might be wordforms and categories might be lemmas, so that the result would correspond to the diversity of wordforms per lemma (i.e. an estimate of the text's morphological diversity).

Units should be in the text's order, since some measures (e.g. MTLD) take it into account. Specific measures may set conditions on the minimal or maximal number of units and raise exceptions when these conditions are not met (see subroutine _validate_size() in Lingua::Diversity::Internals). There should be the same number of items in the unit and category array.

The Lingua::Diversity::Utils module contained within this distribution provides tools for helping with the creation of the array of units and lemmas.

DIAGNOSTICS

Call to abstract method CLASS::METHOD

This exception is raised when either method measure() or method measure_per_category() is called while it is not supported by the selected measure.

CONFIGURATION AND ENVIRONMENT

Some subroutines in module Lingua::Diversity::Utils require a working version of TreeTagger (available at http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger).

DEPENDENCIES

This is the base module of the Lingua::Diversity distribution, which comprises modules Lingua::Diversity::Result, Lingua::Diversity::Utils, Lingua::Diversity::Internals, Lingua::Diversity::X, and Lingua::Diversity::MTLD.

The Lingua::Diversity distribution uses CPAN modules Moose, Exception::Class, and optionally Lingua::TreeTagger.

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2011 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::Diversity::MTLD, Lingua::Diversity::Result, Lingua::Diversity::Utils, Lingua::Diversity::Internals, Lingua::Diversity::X, and Lingua::TreeTagger.