The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Diversity::VOCD - 'VOCD' method for measuring diversity of text units

VERSION

This documentation refers to Lingua::Diversity::VOCD version 0.03

SYNOPSIS

    use Lingua::Diversity::VOCD;
    use Lingua::Diversity::Utils qw( split_text split_tagged_text );

    my $text = 'of the people, by the people, for the people';

    # Create a Diversity object...
    my $diversity = Lingua::Diversity::VOCD->new(
        'length_range'      => [ 35..50 ],
        'num_subsamples'    => 100,
        'min_value'         => 1,
        'max_value'         => 200,
        'precision'         => 0.01,
        'num_trials'        => 3
    );

    # Given some text, get a reference to an array of words...
    my $word_array_ref = split_text(
        'text'          => \$text,
        'unit_regexp'   => qr{[^a-zA-Z]+},
    );

    # Measure lexical diversity...
    my $result = $diversity->measure( $word_array_ref );
    
    # Display results...
    print "Lexical diversity:       ", $result->get_diversity(), "\n";
    print "Variance:                ", $result->get_variance(),  "\n";

    # Tag text using Lingua::TreeTagger...
    use Lingua::TreeTagger;
    my $tagger = Lingua::TreeTagger->new(
        'language' => 'english',
        'options'  => [ qw( -token -lemma -no-unknown ) ],
    );
    my $tagged_text = $tagger->tag_text( \$text );

    # Get references to an array of wordforms and an array of lemmas...
    my ( $wordform_array_ref, $lemma_array_ref ) = split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
        'category'      => 'lemma',
    );

    # Measure morphological diversity...
    $result = $diversity->measure_per_category(
        $wordform_array_ref,
        $lemma_array_ref,
    );

    # Display results...
    print "Morphological diversity: ", $result->get_diversity(), "\n";
    print "Variance:                ", $result->get_variance(),  "\n";

DESCRIPTION

This module implements the 'VOCD' method for measuring the diversity of text units, cf. McKee, G., Malvern, D., & Richards, B. (2000). Measuring Vocabulary Diversity Using Dedicated Software, Literary and Linguistic Computing, 15(3): 323-337 (read it online).

In a nutshell, this method consists in taking a number of subsamples of 35, 36, ..., 49, and 50 tokens at random from the data, then computing the average type-token ratio for each of these lengths, and finding the curve that best fits the type-token ratio curve just produced (among a family of curves generated by expressions that differ only by the value of a single parameter). The parameter value corresponding to the best-fitting curve is reported as the result of diversity measurement. The whole procedure can be repeated several times and averaged.

This implementation also attempts to generalize the authors' original idea to the computation of morphological diversity (see method measure_per_category() below).

NB: The curve fitting procedure used in this implementation relies on the assumption that there is no use in trying larger values of the parameter when the sum of squared residuals has stopped decreasing. This assumption is based on a limited number of tests (as opposed to function analysis), so while it speeds processing greatly, it might prove wrong. If you have analytical or empirical reasons to think the assumption is wrong, please let the author know, he'll be glad to fix the code accordingly.

CREATOR

The creator (new()) returns a new Lingua::Diversity::VOCD object. It takes six optional named parameters:

length_range

A reference to an array specifying the lengths at which the data should be sampled to estimate the growth of type-token ratio. Default is [ 35..50 ].

num_subsamples

The number of subsamples to be drawn for each length in length_range. Default is 100.

min_value

The minimal parameter value that can be tried during curve-fitting (a positive number). Default is 0.01.

max_value

The maximal parameter value that can be tried during curve-fitting (a positive number). Default is 200.

precision

The amount by which the parameter value is changed at each iteration of the curve-fitting procedure (a positive number). Default is 0.01.

num_trials

The number of times that the whole procedure is repeated (a positive number). Default is 3.

ACCESSORS

get_length_range() and set_length_range()

Getter and setter for the length_range attribute.

get_num_subsamples() and set_num_subsamples()

Getter and setter for the num_subsamples attribute.

get_min_value() and set_min_value()

Getter and setter for the min_value attribute.

get_max_value() and set_max_value()

Getter and setter for the max_value attribute.

get_precision() and set_precision()

Getter and setter for the precision attribute.

get_num_trials() and set_num_trials()

Getter and setter for the num_trials attribute.

METHODS

measure()

Apply the VOCD algorithm and return the result in a new Lingua::Diversity::Result object. The result includes the average, variance, and number of observations (i.e. trials in this case).

The method requires a reference to a non-empty array of text units (typically words) as argument. Units need not be in the text's order.

The Lingua::Diversity::Utils module contained within the Lingua::Diversity distribution provides tools for helping with the creation of the array of units.

measure_per_category()

Apply the VOCD algorithm per category and return the result in a new Lingua::Diversity::Result object. For instance, units might be wordforms and categories might be lemmas, so that the result would correspond to the diversity of wordforms per lemma (i.e. an estimate of the text's morphological diversity). The result includes the average, variance, and number of observations (i.e. trials in this case).

The original method described by McGee, Malvern, & Richards (2000) is modified by replacing the type count in the type-token ratio with the number of unit types (e.g. wordform types) divided by the number of category types (e.g. lemma types).

The method requires a reference to a non-empty array of text units and a reference to a non-empty array of categories as arguments. Units and categories need not be in the text's order. They should be in one-to-one correspondence (so that there should be the same number of items in the unit and category arrays).

The Lingua::Diversity::Utils module contained within this distribution provides tools for helping with the creation of the array of units and categories.

DIAGNOSTICS

Method [measure()/measure_per_category()] must be called with a reference to an array as 1st argument

This exception is raised when either method measure() or method measure_per_category() is called without a reference to an array as a first argument.

Method measure_per_category() must be called with a reference to an array as 2nd argument

This exception is raised when method measure_per_category() is called without a reference to an array as a second argument.

Method [measure()/measure_per_category()] was called with an array containing M item(s) while this measure requires at least N item(s)

This exception is raised when either method measure() or method measure_per_category() is called with an array argument that contains less tokens than the upper limit of the length_range attribute (default is 50).

DEPENDENCIES

This module is part of the Lingua::Diversity distribution, and extends Lingua::Diversity.

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2011 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::Diversity