Clusterize - clustering text documents.
Version 0.02
use Clusterize; my %pairs = ( key1 => [ string1, string2, ...stringN ], key2 => [ string5, string6, ...stringM ], ... keyN => [ ... ], ); my $clusterize = Clusterize->new(); while (my @pair = each %files) { $clusterize->add_pair(@pair) } foreach my $c ( $clusterize->list ) { printf "# /%s/ (digest=%s) (accuracy=%.3f) (size=%d)", $c->pattern, $digest, $c->accuracy, $c->size; my $pairs = $c->pairs; for ( keys %{$pairs} ) { print $_." ".$pairs->{$_} } }
Clusterize module implements specific algorithm for clustering text documents.
This is the constructor. No parameter is required.
This method is used to add new document into cluster set:
$clusterize->add_pair($key, [$string1, $string2, ...]);
$key - is uniq name of the document (e.g. filename), [$string1, $string2, ...] - text of the document.
This method is used to remove document from cluster set:
$clusterize->remove_pair($key);
$key - is name of the document (e.g. filename).
This method is used to get list of built clusters:
my @clusters = $clusterize->list();
Returns list of Clusterize::Pattern objects with the following attributes:
$c->pattern - regexp that matches all strings in the given cluster;
$c->accuracy - this value reflects how similar strings in the cluster (value from 0 to 1);
$c->size - how many documents in the cluster;
$c->digest - MD5 digest of the cluster to identify duplicate clusters;
$c->pairs - list of { key => $key1, val => $val1 } hash pairs, where: key - is name of document, val - is string from 'key' document;
Slava Moiseev, <slava.moiseev@yahoo.com>
To install Clusterize, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Clusterize
CPAN shell
perl -MCPAN -e shell install Clusterize
For more information on module installation, please visit the detailed CPAN module installation guide.