The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Cluster::Similarity - compute the similarity of two classifications.

VERSION

Version 0.02

SYNOPSIS

Compute similarity of two classifications following various cluster similarity evaluation schemes based on contingency tables.

    use Cluster::Similarity;


    my $sim_calculator = Cluster::Similarity->new( $classification_1, $classification_2 );


    my $pair_wise_recall = $sim_calculator->pair_wise_recall();
    my $pair_wise_precision = $sim_calculator->pair_wise_precision();
    my $pair_wise_f_score = $sim_calculator->pair_wise_fscore();

    my $mutual_information = $sim_calculator->mutual_information();
    
    my $rand_index = $sim_calculator->rand_index();

    my $rand_adj = $sim_calculator->rand_adjusted($max_index);
    
    my $matching = $sim_calculator->matching_index();


    my $contingency_table = $sim_calculator->contingency();
    
    my $pairs_matrix = $sim_calculator->pairs_matrix();

    my $pair_of_cell_12 = $sim_calculator->pairs(1,2);

DESCRIPTION

Computes the similarity of two word clusterings using several clustering similarity measures.

Consider for eg. the following groupings:

clustering_1: { {a, b, c}, {d, e, f} } clustering_2: { {a, b}, {c, d, e}, {f} }

Cluster similarity measures provide a numerical value helping to assess the alikeness of two such groupings.

All cluster similarity measures implemented in this module are based on the so-called contingency table of the two classifications (clusterings). The contingency table is a matrix with a cell for each pair of classes (one from each classification), containing the number of objects present in both classes.

The similarity measures (and also examples and tests) are taken from Chapter 4 of Susanne Schulte im Walde's Phd thesis:

Sabine Schulte im Walde. Experiments on the Automatic Induction of German Semantic Verb Classes. PhD thesis, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 2003. Published as AIMS Report 9(2) http://www.schulteimwalde.de/phd-thesis.html

Please see there for a more in depth description of the similarity measures and further details.

INTERFACE

Constructor

new()

Builds a new Cluster::Similarity object.

FUNCTIONS

Providing the Data

load_data(\@classification_1, \@classification_2)
load_data(\%classification_1, \%classification_2)
set_classification_1(\@classification_1), set_classification1(\@classification_2)
set_classification_2(\%classification_1), set_classification1(\%classification_2)

When calling these methods, the contingency tables and all previously computed similarity values are reset.

objects, object_number

Return (number of) objects in either classification

contingency

Compute the contingency table for two classifications. The contingency table is a matrix with a cell for each pair of classes (one class from each classification). Each cell contains the number of objects present in both classes.

Eg. For the classifications

  •  { {a, b, c}, {d, e, f} }
  •  { {a, b}, {c, d, e}, {f} }

the returned contingency table is:

 {
   'c_1' => {
             'c_1' => 2,
             'c_2' => 0
            },
   'c_2' => {
             'c_1' => 1,
             'c_2' => 2
            },
   'c_3' => {
             'c_1' => 0,
             'c_2' => 1
            }
 }

Which is a hash representation of the matrix:

      2  0
      1  2
      0  1

with the columns indexed by the classes of the first classification and the rows by the classes of the second classification.

pairs_contingency

Compute the contingency table for the number of common element pairs in the two classifications.

For the example above this would be:

   1 0
   0 0
   0 1

true_positives

True positives are the number of object pairs which occur together in both classifications.

pairs_classification_1, pairs_classification_2

Number of pairs in classification.

pair_wise_precision, pair_wise_recall, pair_wise_fscore

Pair-wise recall is the number of true positives divided by the number of pairs in classification 1

Pair-wise precision is the number of true positives divided by the number of pairs in classification 2

Pair-wise F-score is the harmonic mean of precision and recall, i.e. 2*precision*recall / (precision + recall)

mutual_information

Mutual information is a symmetric measure for the degree of dependency between two classifications used here as introduced by Strehl et. al. (2000).

rand_index

The Rand index (defined by Rand, 1971) is based on the agreement vs. disagreement between object pairs in clusterings.

rand_adjusted

Rand index adjusted by chance (Hubert and Arabie 1985). The adopted model for randomness assumes that the two classifications are picked at random, given the original number of classes and objects - the contingency table is constructed from the hyper-geometric distribution. The general form of an index corrected for chance is:

  Index_adj = (Index - Expected Index) / (Maximum Index - Expected Index)

As maximum index I use the minimum of possible pairs in either classifications.

matching_index

Matching index (Fowlkes and Mallows, 1983).

DIAGNOSTICS

<Need reference to classification>

When a "Providing the data" method is called without enough arguments.

<Classifications must be passed as array or hash references>

Argument of wrong type.

<Please set/load classifications before calling ... method>

Method was called without providing classification data first, by calling one of the ""Providing the data" methods.

<Need data for classification 1/2>

Data for classification 1 (2 resp.) is missing.

CONFIGURATION AND ENVIRONMENT

Cluster::Similarity requires no configuration files or environment variables.

DEPENDENCIES

Carp
Class::Std
List::Util qw(sum min)
Math::Combinatorics

INCOMPATIBILITIES

None reported.

BUGS AND LIMITATIONS

No bugs have been reported.

Please report any bugs or feature requests to bug-cluster-similarity@rt.cpan.org, or through the web interface at http://rt.cpan.org.

TO DO

  • find more suitable return values for when a given similarity measure is not applicable.

  • for the Rand adjusted measure make the maximum index configurable.

AUTHOR

Ingrid Falk, <ingrid dot falk at loria dot fr>

BUGS

Please report any bugs or feature requests to bug-cluster-similarity at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Cluster-Similarity. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Cluster::Similarity

You can also look for information at:

SEE ALSO

  • For the description of the implemented clustering similarity measures:

    Sabine Schulte im Walde. Experiments on the Automatic Induction of German Semantic Verb Classes. PhD thesis, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 2003. Published as AIMS Report 9(2), http://www.schulteimwalde.de/phd-thesis.html

  • For building clusterings or classifications:

    Algorithm::Cluster

    a Perl interface to the C Clustering Library.

    Text::SenseClusters

    Clusters similar contexts using co-occurrence matrices and Latent Semantic Analysis.

COPYRIGHT & LICENSE

Copyright 2008 Ingrid Falk, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 81:

Non-ASCII character seen before =encoding in 'für'. Assuming UTF-8