NAME
Algorithm::ClusterPoints - find clusters inside a set of points
SYNOPSIS
use Algorithm::ClusterPoints;
my $clp = Algorithm::ClusterPoints->new( dimension => 3,
radius => 1.0,
minimum_size => 2,
ordered => 1 );
for my $p (@points) {
$clp->add_point($p->{x}, $p->{y}, $p->{z});
}
my @clusters = $clp->clusters_ix;
for my $i (0..$#clusters) {
print( join( ' ',
"cluster $i:",
map {
my ($x, $y, $z) = $clp->point_coords($_);
"($_: $x, $y, $z)"
} @{$clusters[$i]}
), "\n"
);
}
DESCRIPTION
This module implements an algorithm to find clusters of points inside a set.
Clusters are defined as sets of points where it is possible to stablish a way between any pair of points moving from point to point inside the cluster in steps smaller than a given radius.
Points can have any dimension from one to infinitum, though the algorithm performance degrades quickly as the dimension increases (it has O((2*D)^D) complexity).
The algorithm input parameters are:
- $dimension
-
Dimension of the problem space. For instance, for finding clusters on a geometric plane, dimension will be 2.
- $radius
-
A point is part of a cluster when there is at least another point from the cluster that is at a distance smaller than $radius from it.
- $minimum_size
-
Minimum_number of points required to form a cluster, the default is one.
- @points
-
The coordinates of the points
- $ordered
-
Order the points inside the clusters by their indexes and also order the clusters by the index of the contained points.
Ordering the output data is optional because it can be an computational expensive operation.
API
This module has an object oriented interface with the following methods:
- Algorithm::ClusterPoints->new(%args)
-
returns a new object.
The accepted arguments are:
- dimension => $dimension
-
number of dimensions of the points (defaul is 2).
- radius => $radius
-
maximum aceptable distance between two points to form a cluster (default is 1.0).
- minimum_size => $minimum_size
-
minimun cluster size (default is 1).
- ordered => $ordered
-
sort the returned data structures (default is false).
- scales => [$x_scale, $y_scale, ...]
-
point coordinates are scaled by the coefficients given.
- dimensional_groups => \@dimension_groups
-
See the "Using hypercylindrical distances" chapter below.
- $clp->add_point($x, $y, $z, ...)
- $clp->add_points($x0, $y0, $z0..., $x1, $y1, $z1..., ...);
-
These methods register points into the algorithm.
They return the index of the (first) point added.
- $clp->radius
- $clp->radius($radius)
- $clp->minimum_size
- $clp->minimum_size($minimum_size)
- $clp->ordered
- $clp->ordered($ordered)
-
These methods get or set the algorithm parameters.
- @scales = $clp->scales;
- @scales = $clp->scales($xs, $ys, $zs, ...);
-
gets/sets the scales for all the dimensions.
- @coords = $clp->point_coords($index)
-
returns the coordinates of the point at index
$index
. - @clusters_ix = $clp->clusters_ix
-
returns a list of clusters defined by the indexes of the points inside
The data estructure returned is a list of arrays. Every array represents a cluster and contains the indexes of the points inside.
For instance:
@clusters_ix = ( [ 0, 1, 5, 10, 13, 15, 17, 31, 32, 38 ], [ 2, 12, 20, 26, 27, 29, 33 ], [ 3, 22, 39 ], [ 4, 11, 16, 30, 36 ], [ 6, 14 ], [ 7, 23, 24 ], [ 18, 25 ], [ 21, 40 ] );
You can get back the coordinates of the points using the method
point_coords
, as for instance:for my $c (@clusters_ix) { for my $index (@$c) { my ($x, $y, $z) = $clp->point_coords($index); ...
Or you can use the method
clusters
described below that already returns point coordinates. - @clusters = $clp->clusters
-
returns a list of clusters defined by the coordinates of the points inside.
This method is similar to
clusters_ix
but instead of the point indexes, it includes the point coordinates inside the cluster arrays.This is a sample of the returned structure:
@clusters = ( [ 0.49, 0.32, 0.55, 0.32, 0.66, 0.33 ], [ 0.95, 0.20, 0.83, 0.27, 0.90, 0.20 ], [ 0.09, 0.09, 0.01, 0.08, 0.12, 0.15 ], [ 0.72, 0.42, 0.67, 0.47 ], [ 0.83, 0.11, 0.77, 0.13, 0.73, 0.07 ], [ 0.37, 0.38, 0.36, 0.44 ], [ 0.16, 0.79, 0.14, 0.74 ] );
Note that the coordinate values for all the dimensions are interleaved inside the arrays.
Using hypercylindrical distances
By default distances between points are meassured as euclidean distances. That means that two points A and B form a cluster when B is inside the hypersphere of radius $radius and center A. We will call this hypersphere the clustering limit surface for point A.
Sometimes, specially when the dimensions represent unrelated entities, it is desirable to use hypercylinders as the clustering limit surfaces.
For instance, suppose we have a set of three dimensional points ($x, $y, $t) where the first two dimensions represent coordinates over a geometrical plane and the third coordinate represents time.
It doesn't make sense to mix space and time to calculate a unique distance, and so to have a spherical clustering limit surface. What we need is to set independent limits for geometrical and temporal dimensions, for instance $geo_distance < $geo_radius
and $temp_distance < $temp_radius
and these pair of constraints define a cylinder on our three-dimensional problem space.
In the general multidimensional case, instead of cylinders, we talk about hypercylinders but the logic behind is the same, dimensions are divided in several groups (d-groups) following some problem defined relation and two points form a cluster when all the subdistances are smaller than the radius (where subdistance is the euclidean distance considering only the dimensions in a d-group). Note that every d-group defines a hypercylinder base.
The method that allows to define the hypercylindrical shape is as follows:
- $clp->dimensional_groups(\@group0, \@group1, ...)
-
where @group0, @group1, ... are lists of dimension indexes.
For instance, for a three dimensional problem with dimensions X, Y and T (in that order), to form a group with the dimensions X and Y and another with the dimension T, the following call has to be used:
$clp->dimensional_groups([0, 1], [2]);
The dimensional groups can also be set when the constructor is called:
my $clp = Algoritm::ClusterPoints->new(
dimensional_groups => [[0, 1], [2]],
...);
Usually, when using dimensional groups, you would also want to use the scales
method to set different scales for every dimension group.
Following with the previous example, supposing X and Y are given in meters and T in seconds, to find the clusters with radius between points of 1Km and 2 days, the following scales should be used:
my $spc_scl = 1/1000;
my $tmp_scl = 1/(2 * 24 * 60 * 60);
$clp = Algorithm::ClusterPoints->new(
dimensional_groups => [[0, 1], [2]],
scales => [$spc_scl, $spc_scl, $tmp_scl],
...);
SEE ALSO
All began on this PerlMonks discussion: http://perlmonks.org/?node_id=694892.
Algorithm::Cluster is a Perl wrapper for the C Clustering Library.
COPYRIGHT AND LICENSE
Copyright (C) 2008 by Salvador Fandiño (sfandino@yahoo.com)
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.