The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::DBSCAN - (ALFA code) Perl implementation of the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm

SYNOPSIS

This module can be used to find clusters of points in a multidimensional space. More information can be found on Wikipedia: DBSCAN

The simple usage:

    use Algorithm::DBSCAN;
    
    my $points_data_file =     
        'point_1 56.514307478581514 37.146118456702034
        point_2 34.02049221667614 46.024651786417536
        point_3 23.473087508078684 60.62328221968349
        point_4 10.418513808840482 24.59808378533684
        point_5 10.583414831970764 25.902459835735534
        point_6 9.756855426925464 24.062840099892146
        point_7 10.567067873860672 22.32511341184489
        point_8 11.070046359352189 25.91278382647844
        point_9 9.537780590838175 25.000630928726288
        point_10 10.507367338512058 27.637356924097915
        point_11 11.949089580614444 30.67843911922257
        point_12 10.373548645248105 25.699863108892945
        point_13 47.061169019689615 12.482585189174058
        point_14 47.00269836645959 12.04880276389404
        point_15 47.197663384856476 12.899232975457025
        point_16 44.3719178488551 15.41709269630616
        point_17 46.31921200316786 12.556849509965417
        point_18 44.128763621333135 14.657970021594974
        point_19 48.89953587475758 15.183892607591467
        point_20 52.15333345222132 16.354597634497154
        point_21 50.03978361242539 14.85901473647285';

    my $dataset = Algorithm::DBSCAN::Dataset->new();
    my @lines = split(/\n\s+/, $points_data_file);
    foreach my $line (@lines) {
        $dataset->AddPoint(new Algorithm::DBSCAN::Point(split(/\s+/, $line)));
    }

    my $dbscan = Algorithm::DBSCAN->new($dataset, 4 * 4, 2);

    $dbscan->FindClusters();
    $dbscan->PrintClustersShort();
    

If you have huge datasets and want to use multiple CPUs in a optimal way you can build the region index with an external tool (will soon be available). En axample of code that uses a region index would be as follow.

Given the dataset:

    point_1 56 37
    point_2 34 46
    point_3 23 60
    point_4 10 24
    point_5 10 25
    point_6 9 24
    point_7 10 22
    point_8 11 25
    point_9 9 25
    point_10 10 27
    point_11 11 30
    point_12 10 25
    point_13 47 12
    point_14 47 12
    point_15 47 12
    point_16 44 15
    point_17 46 12
    point_18 44 14
    point_19 48 15
    point_20 52 16
    point_21 50 14

The region index with $eps = 4 x 4 and $min_distance = 2 would look like this:

    0 0
    1 1
    2 2
    3 3 4 5 6 7 8 9 11
    4 3 4 5 6 7 8 9 11
    5 3 4 5 6 7 8 9 11
    10 9 10
    12 12 13 14 16 17 18 20
    11 3 4 5 6 7 8 9 11
    13 12 13 14 16 17 18 20
    14 12 13 14 16 17 18 20
    15 15 16 17
    16 12 13 14 15 16 17 18
    18 12 13 14 16 18 20
    7 3 4 5 6 7 8 9 11
    20 12 13 14 18 19 20
    6 3 4 5 6 7 8 11
    17 12 13 14 15 16 17
    19 19 20
    8 3 4 5 6 7 8 9 11
    9 3 4 5 7 8 9 10 11

To use this index you can use the following code:

    use Algorithm::DBSCAN;
    
    my $points_data_file =     
        'point_1 56 37
        point_2 34 46
        point_3 23 60
        point_4 10 24
        point_5 10 25
        point_6 9 24
        point_7 10 22
        point_8 11 25
        point_9 9 25
        point_10 10 27
        point_11 11 30
        point_12 10 25
        point_13 47 12
        point_14 47 12
        point_15 47 12
        point_16 44 15
        point_17 46 12
        point_18 44 14
        point_19 48 15
        point_20 52 16
        point_21 50 14';

    my $dataset = Algorithm::DBSCAN::Dataset->new();
    my @lines = split(/\n\s+/, $points_data_file);
    foreach my $line (@lines) {
        $dataset->AddPoint(new Algorithm::DBSCAN::Point(split(/\s+/, $line)));
    }

    my $dbscan = Algorithm::DBSCAN->new($dataset, 4 * 4, 2);

    $dbscan->UseRegionIndex(the filename containg the previous region index);
    $dbscan->FindClusters();
    $dbscan->PrintClustersShort();
  

SUBROUTINES/METHODS

new

The constructor takes 3 parameters:

    $dataset: The Algorithm::DBSCAN::Dataset dataset object
        
        Create the Dataset object:
            my $dataset = Algorithm::DBSCAN::Dataset->new();
        
        Add points (the first parameter is a point_id the other are point coordinates)
            $dataset->AddPoint(new Algorithm::DBSCAN::Point('point_1', 1, 2, 3, 4, 5);
            
    $eps: The epsilon parameter used for region density computation
        WARNING: This implementation uses the sqare distance between the points to avoid 
        a useless square root call. If you want to use the euclidian distance you need to 
        convert it to the right value yourself.
        
        For example for the previous point with 5 dimensions 
        $eps = $euclidian_distance * $euclidian_distance * $euclidian_distance * $euclidian_distance * $euclidian_distance; 
        
    $min_points: the minimal number of points in a region with a radius of $eps. $eps 
    and $min_points are the 2 parameters used to compute the denisty of a region. If 
    the number of points in a region with radius $eps is lower than $min_points the 
    point is considered as an outlier point that can't be included in any cluster.

FindClusters

The main method that will run the DBSCAN algorithm on the Dataset.

ExpandCluster

This method will expand the cluster starting by the neighborhood of point $point

GetRegion

Find all points in the dataset that are in the neighborhood of $point

UseRegionIndex

For huge datasets a region index can be generated separately (and using multiple cores). The index is a list of regions for each point in the dataset.

PrintClusters

Will print the contents of the clusters

PrintClustersShort

Will print the contents of the clusters (abreviated version)

_one_more_point_visited

Simple method used to display progress

AUTHOR

Michal TOMA, <mtoma at cpan.org>

BUGS

Please report any bugs or feature requests on github: https://github.com/mtoma/Algorithm-DBSCAN

By e-mail to bug-algorithm-dbscan at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Algorithm-DBSCAN. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Algorithm::DBSCAN

You can also look for information at:

LICENSE AND COPYRIGHT

Copyright 2016 Michal TOMA.

This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:

http://www.perlfoundation.org/artistic_license_2_0

Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license.

If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license.

This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder.

This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed.

Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.