The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Diversity - measuring the diversity of text units

VERSION

This documentation refers to Lingua::Diversity version 0.06.

SYNOPSIS

    use Lingua::Diversity;
    use Lingua::Diversity::Utils qw( split_text split_tagged_text );

    my $text = 'of the people, by the people, for the people';

    # Create a Diversity object (here using method 'Variety')...
    my $diversity = Lingua::Diversity::Variety->new();

    # Given some text, get a reference to an array of words...
    my $word_array_ref = split_text(
        'text'      => \$text,
        'regexp'    => qr{[^a-zA-Z]+},
    );

    # Measure lexical diversity...
    my $result = $diversity->measure( $word_array_ref );
    
    # Display results...
    print "Lexical diversity:       ", $result->get_diversity(), "\n";

    # Tag text using Lingua::TreeTagger...
    use Lingua::TreeTagger;
    my $tagger = Lingua::TreeTagger->new(
        'language' => 'english',
        'options'  => [ qw( -token -lemma -no-unknown ) ],
    );
    my $tagged_text = $tagger->tag_text( \$text );

    # Get references to an array of wordforms and an array of lemmas...
    my ( $wordform_array_ref, $lemma_array_ref ) = split_tagged_text(
        'tagged_text'   => $tagged_text,
        'unit'          => 'original',
        'category'      => 'lemma',
    );

    # Measure morphological diversity...
    $result = $diversity->measure_per_category(
        $wordform_array_ref,
        $lemma_array_ref,
    );

    # Display results...
    print "Morphological diversity: ", $result->get_diversity(), "\n";

DESCRIPTION

This is the base module of distribution Lingua::Diversity, which provides a simple object-oriented interface for applying various measures of diversity to text units. At present, it implements the VOCD algorithm (see Lingua::Diversity::VOCD, the MTLD algorithm (see Lingua::Diversity::MTLD), and many variants of variety (see Lingua::Diversity::Variety).

The documentation of this module is designed as a tutorial exposing the main features of this distribution. For details about measure-specific features, please refer to Lingua::Diversity::VOCD, Lingua::Diversity::MTLD), and Lingua::Diversity::Variety (which itself is related to Lingua::Diversity::SamplingScheme). For details about utility subroutines, please refer to Lingua::Diversity::Utils. Finally, details about the format in which results are stored, please refer to Lingua::Diversity::Result

Basics

This module is all about measuring diversity. While it is able to deal with any kind of nominal (as opposed to numeric) data, its design has been guided by a particular interest in text data. In this context, diversity often means lexical diversity, i.e. an estimate of the rate at which new words appear in a text.

It turns out that there are many different ways to measure lexical diversity, quite a few of which are implemented in Lingua::Diversity. In this framework, all measures operate on the same kind of data, namely arrays, e.g.:

    my @data = qw( say you say me );

In general, we will speak of units to refer to the elements of such an array. In this example, they are words, but they might as well be other kinds of linguistic units, such as letters for example--although in the latter case we would be measuring a graphemic rather than a lexical sort of diversity.

As shown in the previous example, units in the array need not be unique. In fact, arrays consisting only of unique items are a very special case, the case where diversity is maximal. In most cases, arrays will contain repeated items, and thus have a less than maximal diversity.

In order to measure the diversity of units in a given array, we must first create a Lingua::Diversity object. More precisely, we should create an object from a class derived from Lingua::Diversity, such as Lingua::Diversity::Variety (other options are Lingua::Diversity::MTLD and Lingua::Diversity::VOCD). Since this module (Lingua::Diversity) imports all derived classes, we may simply use it and call the new() method of a derived class as follows:

    use Lingua::Diversity;
    my $diversity = Lingua::Diversity::Variety->new();

This creates a new object for measuring the variety of units in an array. In its most basic form, variety is simply the number of distinct units in the array, i.e. 3 in our example (say, you, and me). Distinct units are also called unit types, while unit tokens refer to the possibly repeated units found in the array (so the number of tokens is the size of the array, i.e. 4 in our example).

With this new object at hand, we may measure the variety of words in our array by calling the measure() method on the object:

    my $result = $diversity->measure( \@data );

This method takes a single argument, namely a reference to an array of units. It is important to note that it uses a reference (\@data) and not the array itself (@data), because this is the way all diversity measures in the distribution operate.

Now the return value of method measure() is a Lingua::Diversity::Result object, and such objects store the measured diversity in a field called diversity which may be accessed like this:

    print $result->get_diversity();

To sum up, here's how to compute and display the variety of units in an array:

    use Lingua::Diversity;

    my @data      = qw( say you say me );

    my $diversity = Lingua::Diversity::Variety->new();
    my $result    = $diversity->measure( \@data );

    print $result->get_diversity();

This will print 3, the number of types in the array. If you're not impressed, hold on, this was just the basics.

Tweaking a diversity measure

All diversity measures in Lingua::Diversity can be parameterized in a number of ways. For instance, rather than plain variety, you might be interested in the so-called type-token ratio, i.e. the ratio of the number of types to the number of tokens. As it happens, Lingua::Diversity::Variety objects have a transform attribute which is set to none by default, but which can be set to type_token_ratio (among others). This can be done either at object creation:

    my $diversity = Lingua::Diversity::Variety->new(
        'transform' => 'type_token_ratio',
    );

or using the set_transform() method on a previously created object:

    $diversity->set_transform( 'type_token_ratio' );

To display the type-token ratio of an array, you may proceed as before:

    my $result = $diversity->measure( \@data );
    print $result->get_diversity();

By the way, if you don't plan to re-use the result, you can also chain method calls like this:

    print $diversity->measure( \@data )->get_diversity();

Both approaches will display the type-token ratio, i.e. 0.75 in our example (3 types divided by 4 tokens).

To take a more sophisticated example, suppose that you are not merely interested in the type-token ratio, but in the average type-token ratio over segments of N tokens in the array (sometimes called mean segmental type-token ratio). Setting N to 2, and reading from left to right, there are two such segments in our example, namely say you and say me. Each has a type-token ratio of 1 (2 types divided by 2 tokens), so the average is 1. This is what you get with the following piece of code:

    use Lingua::Diversity;

    my @data        = qw( say you say me );

    my $diversity   = Lingua::Diversity::Variety->new(
        'transform'       => 'type_token_ratio',
        'sampling_scheme' => Lingua::Diversity::SamplingScheme->new(
            'mode'           => 'segmental',
            'subsample_size' => 2,
        ),
    );
    my $result      = $diversity->measure( \@data );

    print 'Average type-token ratio: ', $result->get_diversity(), "\n";
    print 'Variance:                 ', $result->get_variance(),  "\n";
    print 'Number of observations:   ', $result->get_count(),     "\n";

As a bonus, you also get the variance of type-token ratio over segments (0 in this case, since both segments have the same type-token ratio) as well as the number of segments over which the average was computed, i.e. 2. This extra information is available because we have specified a sampling scheme at object construction, so that method measure() knows that it must work on a number of subsamples and return a Lingua::Diversity::Result object storing an average (accessed with method get_diversity()), variance (accessed with method get_variance()), and number of observations (accessed with method get_count()).

This fairly involved example gives an idea of how versatile Lingua::Diversity can be. The reader is invited to refer to Lingua::Diversity::Variety and Lingua::Diversity::SamplingScheme for detailed explanations on how to parameterize a variety measure; other measures have yet other sets of parameters, as documented in Lingua::Diversity::MTLD and Lingua::Diversity::VOCD.

Average diversity per category

Suppose that you do not only have an array of units, but also an array of corresponding categories. For instance, categories might be part-of-speech tags:

    my @units      = qw( say  you     say  me      );
    my @categories = qw( VERB PRONOUN VERB PRONOUN );

Categories can be anything that can be put in one-to-one correspondence with units. Indeed, the only constraint here is that the number of elements be the same in the two arrays, so that you might as well use letters and letter categories:

    my @units      = qw( l e t t e r s );
    my @categories = qw( C V C C V C C );

If this extra bit of information is available, we can estimate the diversity of units per category. What this means exactly depends on the diversity measure being considered. In the case of variety, it would be the average variety (or type-token ratio, etc.) per category. From the example above, it can be seen that there is one unit type in category VERB (namely say) and two unit types in category PRONOUN (namely you and me), so the average variety per category is 1.5. This what you will obtain by calling method measure_per_category() with references to the two arrays as arguments:

    use Lingua::Diversity;

    my @units      = qw( say  you     say  me      );
    my @categories = qw( VERB PRONOUN VERB PRONOUN );

    my $diversity  = Lingua::Diversity::Variety->new();
    my $result     = $diversity->measure_per_category(
        \@units,
        \@categories,
    );

    print $result->get_diversity();

Furthermore, you may request that the average be weighted according to the relative frequency of categories. Consider the example of letters and letter categories above. Category C has a variety of 4 and a relative frequency of 5/7, while category V has a variety of 1 and a relative frequency of 2/7. Thus, the unweighted average is 2.5, but the weighted average is 3.143, which reflects the greater weight of the category with highest variety. To compute the weighted variant with Lingua:Diversity::Variety simply set the category_weighting parameter to true at object creation (or using method category_weighting()):

    my $diversity = Lingua::Diversity::Variety->new(
        'category_weighting' => 1,
    );

Of course, this can be parameterized with the transform or sampling_scheme parameters seen above, or any of the parameters documented in Lingua:Diversity::Variety. Classes Lingua::Diversity::VOCD and Lingua::Diversity::MTLD also support method measure_per_category(), with their own semantics and parameters.

Utility subroutines

The Lingua::Diversity distribution includes a couple of utility subroutines intended to facilitate the creation of unit and category arrays. These subroutines are exported by module Lingua::Diversity::Utils.

Subroutine split_text() splits a text based on a regular expression describing delimiter sequences (just like the built-in split() function), removes empty elements (if any), and returns a reference to the resulting array, which can then be used as the argument of a call to method measure():

    use Lingua::Diversity;
    use Lingua::Diversity::Utils qw( split_text );
    
    my $text           = 'of the people, by the people, for the people';
    my $word_array_ref = split_text(
        'text'      => \$text,
        'regexp'    => qr{[^a-zA-Z]+},
    );

    my $diversity      = Lingua::Diversity::Variety->new();
    my $result         = $diversity->measure( $word_array_ref );

This module also exports a subroutine (_split_tagged_text()) to build both a unit array and a category array on the basis of the output of the Lingua::TreeTagger module, cf. "SYNOPSIS" for an example and Lingua::Diversity::Utils for detailed explanations.

METHODS

measure()

Apply the selected diversity measure and return the result in a new Lingua::Diversity::Result object.

The method requires a reference to a non-empty array of text units (typically words) as argument.

Some measures, in particular Lingua::Diversity::MTLD (as well as Lingua::Diversity::Variety under segmental sampling scheme) take the order of units into account. Specific measures may set conditions on the minimal or maximal number of units and raise exceptions when these conditions are not met.

The Lingua::Diversity::Utils module contained within this distribution provides tools for helping with the creation of the array of units.

measure_per_category()

Apply the selected diversity measure per category and return the result in a new Lingua::Diversity::Result object. For instance, units might be wordforms and categories might be lemmas, so that the result would correspond to the diversity of wordforms per lemma (i.e. an estimate of the text's morphological diversity).

Some measures, in particular Lingua::Diversity::MTLD (as well as Lingua::Diversity::Variety under segmental sampling scheme) take the order of units into account. Specific measures may set conditions on the minimal or maximal number of units and raise exceptions when these conditions are not met. There should always be the same number of items in both arrays.

The Lingua::Diversity::Utils module contained within this distribution provides tools for helping with the creation of the array of units and lemmas.

DIAGNOSTICS

Call to abstract method CLASS::_measure()

This exception is raised when either method measure() or method measure_per_category() is called while internal method _measure() is not implemented in a class derived from Lingua::Diversity.

Method [measure()/measure_per_category()] must be called with a reference to an array as 1st argument

This exception is raised when either method measure() or method measure_per_category() is called without a reference to an array as a first argument.

Method measure_per_category() must be called with a reference to an array as 2nd argument

This exception is raised when method measure_per_category() is called without a reference to an array as a second argument.

Method [measure()/measure_per_category()] was called with an array containing N item(s) while this measure requires [at least/at most] M item(s)

This exception is raised when either method measure() or method measure_per_category() is called with an argument array that is either too small or too large relative to conditions set by the selected measure.

CONFIGURATION AND ENVIRONMENT

Some subroutines in module Lingua::Diversity::Utils require a working version of TreeTagger (available at http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger).

DEPENDENCIES

This is the base module of the Lingua::Diversity distribution, which comprises modules Lingua::Diversity::Result, Lingua::Diversity::SamplingScheme, Lingua::Diversity::Internals, Lingua::Diversity::Internals, Lingua::Diversity::Variety, Lingua::Diversity::MTLD, Lingua::Diversity::VOCD, and Lingua::Diversity::X.

The Lingua::Diversity distribution uses CPAN modules Moose, Exception::Class, and optionally Lingua::TreeTagger.

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2011 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::Diversity::Result, Lingua::Diversity::SamplingScheme, Lingua::Diversity::Internals, Lingua::Diversity::Internals, Lingua::Diversity::Variety, Lingua::Diversity::MTLD, Lingua::Diversity::VOCD, Lingua::Diversity::X, and Lingua::TreeTagger.