NAME
Clusterize - clustering text documents.
VERSION
Version 0.02
SYNOPSIS
use
Clusterize;
my
%pairs
= (
key1
=> [ string1, string2, ...stringN ],
key2
=> [ string5, string6, ...stringM ],
...
keyN
=> [ ... ],
);
my
$clusterize
= Clusterize->new();
while
(
my
@pair
=
each
%files
) {
$clusterize
->add_pair(
@pair
) }
foreach
my
$c
(
$clusterize
->list ) {
printf
"# /%s/ (digest=%s) (accuracy=%.3f) (size=%d)"
,
$c
->pattern,
$digest
,
$c
->accuracy,
$c
->size;
my
$pairs
=
$c
->pairs;
for
(
keys
%{
$pairs
} ) {
$_
.
" "
.
$pairs
->{
$_
} }
}
DESCRIPTION
Clusterize module implements specific algorithm for clustering text documents.
PUBLIC METHODS
new
This is the constructor. No parameter is required.
add_pair
This method is used to add new document into cluster set:
$clusterize->add_pair($key, [$string1, $string2, ...]);
$key - is uniq name of the document (e.g. filename), [$string1, $string2, ...] - text of the document.
remove_pair
This method is used to remove document from cluster set:
$clusterize->remove_pair($key);
$key - is name of the document (e.g. filename).
list
This method is used to get list of built clusters:
my @clusters = $clusterize->list();
Returns list of Clusterize::Pattern objects with the following attributes:
$c->pattern - regexp that matches all strings in the given cluster;
$c->accuracy - this value reflects how similar strings in the cluster (value from 0 to 1);
$c->size - how many documents in the cluster;
$c->digest - MD5 digest of the cluster to identify duplicate clusters;
$c->pairs - list of { key => $key1, val => $val1 } hash pairs, where: key - is name of document, val - is string from 'key' document;
AUTHOR
Slava Moiseev, <slava.moiseev@yahoo.com>