The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Diversity::Variety - measuring the variety of text units

VERSION

This documentation refers to Lingua::Diversity::Variety version 0.03.

SYNOPSIS

    use Lingua::Diversity::Variety;
    use Lingua::Diversity::Utils qw( split_text split_tagged_text );

    my $text = 'of the people, by the people, for the people';

    # Create a Diversity object...
    my $diversity = Lingua::Diversity::Variety->new();

    # Given some text, get a reference to an array of words...
    my $word_array_ref = split_text(
        'text'          => \$text,
        'unit_regexp'   => qr{[^a-zA-Z]+},
    );

    # Measure lexical diversity...
    my $result = $diversity->measure( $word_array_ref );
    
    # Display results...
    print "Lexical diversity:       ", $result->get_diversity(), "\n";

    # Tag text using Lingua::TreeTagger...
    use Lingua::TreeTagger;
    my $tagger = Lingua::TreeTagger->new(
        'language' => 'english',
        'options'  => [ qw( -token -lemma -no-unknown ) ],
    );
    my $tagged_text = $tagger->tag_text( \$text );

    # Get references to an array of wordforms and an array of lemmas...
    my ( $wordform_array_ref, $lemma_array_ref ) = split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
        'category'      => 'lemma',
    );

    # Measure morphological diversity...
    $result = $diversity->measure_per_category(
        $wordform_array_ref,
        $lemma_array_ref,
    );

    # Display results...
    print "Morphological diversity: ", $result->get_diversity(), "\n";

DESCRIPTION

This module computes the variety of text units, which is the number of distinct units (i.e. unit "types") in a text, or the average number of unit types per category.

It provides independent controls for weighting units according to their relative frequency (yielding the so-called perplexity measure, i.e. the exponential of the entropy) and for weighting categories according to their relative frequency. In this documentation, unless specified otherwise, the term variety will be used to cover all four possible combinations of unit and category weighting.

The module includes a number of predefined transforms that can be applied to variety to obtain derived indices such as type-token ratio, mean frequency, and so on. Users may also define custom transforms.

Furthermore, the user may request variety to be computed on subsamples of the text, which yields an estimate of the average variety per subsample (see e.g. Xanthos, A., & Gillis, S. (2010). Quantifying the development of inflectional diversity, First Language, 30(2): 175-198. (read preprint version online).

CREATOR

The creator (new()) returns a new Lingua::Diversity::Variety object. It takes five optional named parameters:

unit_weighting

A boolean value indicating whether the relative frequency of unit types should taken into account in the computation. If so, the module will compute the perplexity of units instead of their variety, i.e. the exponential of the Shannon entropy (in nats), which is expressed on the same scale as variety. Default is 0 (no unit weighting).

Perplexity tends toward a minimal value of 1 when the text contain very few occurrences of all unit types but one very frequent one; its maximal value is the variety and it is attained when all types have the same frequency.

Intuitively, using perplexity instead of variety (i.e. setting unit_weighting to 1 instead of 0) amounts to considering that a text with 10 occurrences of word a, 1 occurrence of word b, and 1 occurrence of word c, has a lesser diversity than a text where each of these words has a frequency of 4.

category_weighting

A boolean value indicating whether the relative frequency of category types should taken into account in the computation (this is relevant only for method measure_per_category()). If so, the module will compute the weighted average of units per type instead of the unweighted average. Default is 0 (no category weighting).

Intuitively, using a weighted average (i.e. setting category_weighting to 1 instead of 0) amounts to considering that the variety of units within a category that has relative frequency 0.9 should contribute more to the overall diversity (per category) than the variety of units within a category with relative frequency 0.1.

transform

This parameter specifies the transform that should be applied to variety, if any. Transforms are various functions of variety and text length (i.e. number of types and tokens respectively).

The value of this parameter can be a reference to a (possibly anonymous) subroutine. The subroutine will be passed the variety as first argument and the number of tokens as second argument. It should return the transformed variety.

Alternatively, the value of this parameter can be a string referring to one of the following predefined transforms, where M stands for the variety and N stands for the number of tokens:

none (default value)

No transform.

type_token_ratio

M / N

mean_frequency

N / M = 1 / type_token_ratio

guiraud

M / sqrt( N )

herdan

ln( M ) / ln( N )

rubet

ln( M ) / ln( ln( N ) )

maas

( ln( N ) - ln( M ) ) / ln( N )^2

dugast

ln( N )^2 / ( ln( N ) - ln( M ) ) = 1 / maas

lukjanenkov_nesitoj

( 1 - ln( M )^2 ) / ( ln( N ) * ln( M )^2 )

sampling_scheme

This parameter indicates which form of resampling should be applied, if any. Default is undef, i.e. no resampling. Otherwise, the value must be a Lingua::Diversity::SamplingScheme object. In this case, the reported diversity will be the average number of distinct unit types in subsamples of a given size (possibly per categroy), cf. Xanthos, A., & Gillis, S. (2010). Quantifying the development of inflectional diversity, First Language, 30(2): 175-198. (read preprint version online)

Note that resampling and averaging (if any) are applied after any tranform specified by the transform parameter, so that the reported variety is the average transformed diversity per subsample and the reported variance is the variance of the transformed diversity. The number of tokens used in the transform is set to the value of the subsample_size attribute of the sampling scheme (see Lingua::Diversity::SamplingScheme).

ACCESSORS, PREDICATES, AND CLEARERS

get_unit_weighting() and set_unit_weighting()

Getter and setter for the unit_weighting attribute.

get_category_weighting() and set_category_weighting()

Getter and setter for the category_weighting attribute.

get_transform() and set_transform()

Getter and setter for the transform attribute.

get_sampling_scheme(), set_sampling_scheme(), has_sampling_scheme(), and clear_sampling_scheme()

Getter, setter, predicate, and clearer for the sampling_scheme attribute.

METHODS

measure()

Compute the variety of text units and return the result in a new Lingua::Diversity::Result object.

If no resampling is applied (which is the default behavior), the result includes only the diversity field (no variance and count).

If resampling is applied, the result includes the average, variance, and number of observations (the latter being the value of the sampling scheme's num_subsamples attribute, cf. Lingua::Diversity::SamplingScheme).

The method requires a reference to a non-empty array of text units (typically words) as argument. Units don't need to be in the text's order (unless you are using a sampling scheme with mode segmental, cf. Lingua::Diversity::SamplingScheme).

The Lingua::Diversity::Utils module contained within the Lingua::Diversity distribution provides tools for helping with the creation of the array of units.

measure_per_category()

Compute the average variety of text units per category and return the result in a new Lingua::Diversity::Result object. For instance, units might be wordforms and categories might be lemmas, so that the result would correspond to the variety of wordforms per lemma (i.e. an estimate of the text's morphological diversity).

If no resampling is applied (which is the default behavior), the result includes only the diversity field (no variance and count).

If resampling is applied, the result includes the average, variance, and number of observations (the latter being the value of the sampling scheme's num_subsamples attribute, cf. Lingua::Diversity::SamplingScheme).

The method requires a reference to a non-empty array of text units and a reference to a non-empty array of categories as arguments. Units and categories don't need to be in the text's order (unless you are using a sampling scheme with mode segmental, cf. Lingua::Diversity::SamplingScheme). They must be in one-to-one correspondence (so that there should be the same number of items in the unit and category arrays).

The Lingua::Diversity::Utils module contained within this distribution provides tools for helping with the creation of the array of units and categories.

DIAGNOSTICS

Method [measure()/measure_per_category()] must be called with a reference to an array as 1st argument

This exception is raised when either method measure() or method measure_per_category() is called without a reference to an array as a first argument.

Method measure_per_category() must be called with a reference to an array as 2nd argument

This exception is raised when method measure_per_category() is called without a reference to an array as a second argument.

Method [measure()/measure_per_category()] was called with an array containing M item(s) while this measure requires at least N item(s)

This exception is raised when either method measure() or method measure_per_category() is called with an array argument that does not contain enough items. In practice, it may be that the array is empty, or that it contains less items than the value of the subsample_size attribute of the sampling scheme, if any (see Lingua::Diversity::SamplingScheme).

DEPENDENCIES

This module is part of the Lingua::Diversity distribution, and extends Lingua::Diversity.

BUGS AND LIMITATIONS

There are no known bugs in this module.

There is a known problem with 'dugast' transform (see above): if a text (or subsample) has maximal variety (i.e. the number of types is equal to the number of tokens), the denominator of this transform becomes 0, which raises an "illegal division by zero error".

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2011 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::Diversity and Lingua::Diversity::SamplingScheme