The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Diversity::MTLD - 'MTLD' method for measuring diversity of text units

VERSION

This documentation refers to Lingua::Diversity::MTLD version 0.04.

SYNOPSIS

    use Lingua::Diversity::MTLD;
    use Lingua::Diversity::Utils qw( split_text split_tagged_text );

    my $text = 'of the people, by the people, for the people';

    # Create a Diversity object...
    my $diversity = Lingua::Diversity::MTLD->new(
        'threshold'         => 0.71,
        'weighting_mode'    => 'within_and_between',
    );

    # Given some text, get a reference to an array of words...
    my $word_array_ref = split_text(
        'text'          => \$text,
        'unit_regexp'   => qr{[^a-zA-Z]+},
    );

    # Measure lexical diversity...
    my $result = $diversity->measure( $word_array_ref );
    
    # Display results...
    print "Lexical diversity:       ", $result->get_diversity(), "\n";
    print "Variance:                ", $result->get_variance(),  "\n";

    # Tag text using Lingua::TreeTagger...
    use Lingua::TreeTagger;
    my $tagger = Lingua::TreeTagger->new(
        'language' => 'english',
        'options'  => [ qw( -token -lemma -no-unknown ) ],
    );
    my $tagged_text = $tagger->tag_text( \$text );

    # Get references to an array of wordforms and an array of lemmas...
    my ( $wordform_array_ref, $lemma_array_ref ) = split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
        'category'      => 'lemma',
    );

    # Measure morphological diversity...
    $result = $diversity->measure_per_category(
        $wordform_array_ref,
        $lemma_array_ref,
    );

    # Display results...
    print "Morphological diversity: ", $result->get_diversity(), "\n";
    print "Variance:                ", $result->get_variance(),  "\n";

DESCRIPTION

This module implements the MTLD method for measuring the diversity of text units. MTLD stands for Measure of Textual Lexical Diversity, which is also known as LDAT (Lexical Diversity Assessment Tool), cf. McCarthy, P.M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment, Behavior Research Methods, 42(2): 381-392 (read it online).

The MTLD method is based on the type-token ratio of a text, i.e. the ratio of the number of distinct words--or more generally text units--to the total number of units. Leaving aside the nasty details, the idea is to compute the average length of a sequence of contiguous text units maintaining a type-token ratio above a specified threshold, which is set to 0.72 by McCarthy and Jarvis (2010). They call such a sequence a 'factor' of the text.

The present implementation also returns the variance of factor length, as well as the number of observations, which in most cases will not be an integer (see the notion of partial factor in McCarthy and Jarvis (2010) for a detailed explanation of why it is so).

This implementation also attempts to generalize the authors' original idea to the computation of morphological diversity (see method measure_per_category() below).

CREATOR

The creator (new()) returns a new Lingua::Diversity::MTLD object. It takes two optional named parameters:

threshold

The TTR value which a sequence of contiguous text units must maintain to constitute a 'factor'. It should be comprised between 0 and 1 exclusive. Default is 0.72.

weighting_mode

The computation of MTLD is performed two times, once in left-to-right text order and once in right-to-left text order. Each pass yields a weighted average (and variance), and the two averages are in turned averaged to get the value that is finally reported (the two variances are also averaged). This attribute indicates whether the reported average should itself be weighted according to the potentially different number of observations in the two passes (value 'within_and_between'), or not (value 'within_only'). The default value is 'within_only', to conform with McCarthy and Jarvis (2010), although the author of this implementation finds it more consistent to select 'within_and_between'.

ACCESSORS

get_threshold() and set_threshold()

Getter and setter for the threshold attribute.

get_weighting_mode() and set_weighting_mode()

Getter and setter for the weighting_mode attribute.

METHODS

measure()

Apply the MTLD measure and return the result in a new Lingua::Diversity::Result object. The result includes the average, variance, and number of observations.

The method requires a reference to a non-empty array of text units (typically words) as argument. Units should be in the text's order.

The Lingua::Diversity::Utils module contained within the Lingua::Diversity distribution provides tools for helping with the creation of the array of units.

measure_per_category()

Apply the diversity measure per category and return the result in a new Lingua::Diversity::Result object. For instance, units might be wordforms and categories might be lemmas, so that the result would correspond to the diversity of wordforms per lemma (i.e. an estimate of the text's morphological diversity). The result includes the average, variance, and number of observations.

The original method described by McCarthy and Jarvis (2010) is modified by replacing the type count in the type-token ratio with the number of unit types (e.g. wordform types) divided by the number of category types (e.g. lemma types). First experiments suggest that the default threshold value (0.72) is inappropriate in this case, and should be replaced by a much smaller value (e.g. 0.1).

The method requires a reference to a non-empty array of text units and a reference to a non-empty array of categories as arguments. Units and categories should be in the text's order and in one-to-one correspondence (so that there should be the same number of items in the unit and category arrays).

The Lingua::Diversity::Utils module contained within this distribution provides tools for helping with the creation of the array of units and categories.

DIAGNOSTICS

Method [measure()/measure_per_category()] must be called with a reference to an array as 1st argument

This exception is raised when either method measure() or method measure_per_category() is called without a reference to an array as a first argument.

Method measure_per_category() must be called with a reference to an array as 2nd argument

This exception is raised when method measure_per_category() is called without a reference to an array as a second argument.

Method [measure()/measure_per_category()] was called with an array containing 0 item(s) while this measure requires at least 1 item(s)

This exception is raised when either method measure() or method measure_per_category() is called with an empty array as argument.

DEPENDENCIES

This module is part of the Lingua::Diversity distribution, and extends Lingua::Diversity.

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2011 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::Diversity