The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::KMeans - for clustering multidimensional data

SYNOPSIS

  # You now have four different choices for clustering your data with this module:
  #
  #     1)  With Euclidean distances and with random cluster seeding
  #    
  #     2)  With Mahalanobis distances and with random cluster seeding
  #   
  #     3)  With Euclidean distances and with smart cluster seeding
  #
  #     4)  With Mahalanobis distances and with smart cluster seeding
  #
  # Despite the qualifier 'smart' in 'smart cluster seeding', it may not always
  # produce results that are superior to those obtained with random seeding.  (If you
  # also factor in the choice regarding variance normalization, you actually have
  # eight different choices for data clustering with this module.)
  #
  # In all cases, you'd obviously begin with

  use Algorithm::KMeans;

  # You'd then name the data file:

  my $datafile = "mydatafile.csv";

  # Next, set the mask to indicate which columns of the datafile to use for
  # clustering and which column contains a symbolic ID for each data record. For
  # example, if the symbolic name is in the first column, you want the second column
  # to be ignored, and you want the next three columns to be used for 3D clustering,
  # you'd set the mask to:

  my $mask = "N0111";

  # Now construct an instance of the clusterer.  The parameter K controls the number
  # of clusters.  If you know how many clusters you want (let's say 3), call

  my $clusterer = Algorithm::KMeans->new( datafile        => $datafile,
                                          mask            => $mask,
                                          K               => 3,
                                          cluster_seeding => 'random',
                                          terminal_output => 1,
                                          write_clusters_to_files => 1,
                                        );

  # By default, this constructor call will set you up for clustering based on
  # Euclidean distances.  If you want the module to use Mahalanobis distances, your
  # constructor call will look like:

  my $clusterer = Algorithm::KMeans->new( datafile        => $datafile,
                                          mask            => $mask,
                                          K               => 3,
                                          cluster_seeding => 'random',
                                          use_mahalanobis_metric => 1,
                                          terminal_output => 1,
                                          write_clusters_to_files => 1,
                                        );

  # For both constructor calls shown above, you can use smart seeding of the clusters
  # by changing 'random' to 'smart' for the cluster_seeding option.  See the
  # explanation of smart seeding in the Methods section of this documentation.

  # If your data is such that its variability along the different dimensions of the
  # data space is significantly different, you may get better clustering if you first
  # normalize your data by setting the constructor parameter
  # do_variance_normalization as shown below:

  my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
                                          mask     => $mask,
                                          K        => 3,
                                          cluster_seeding => 'smart',    # or 'random'
                                          terminal_output => 1,
                                          do_variance_normalization => 1,
                                          write_clusters_to_files => 1,
                                        );

  # But bear in mind that such data normalization may actually decrease the
  # performance of the clusterer if the variability in the data is more a result of
  # the separation between the means than a consequence of intra-cluster variance.

  # Set K to 0 if you want the module to figure out the optimum number of clusters
  # from the data. (It is best to run this option with the terminal_output set to 1
  # so that you can see the different value of QoC for the different K):

  my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
                                          mask     => $mask,
                                          K        => 0,
                                          cluster_seeding => 'random',    # or 'smart'
                                          terminal_output => 1,
                                          write_clusters_to_files => 1,
                                        );

  # Although not shown above, you can obviously set the 'do_variance_normalization'
  # flag here also if you wish.

  # For very large data files, setting K to 0 will result in searching through too
  # many values for K.  For such cases, you can range limit the values of K to search
  # through by

  my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
                                          mask     => "N111",
                                          Kmin     => 3,
                                          Kmax     => 10,
                                          cluster_seeding => 'random',    # or 'smart'
                                          terminal_output => 1,
                                          write_clusters_to_files => 1,
                                        );

  # FOR ALL CASES ABOVE, YOU'D NEED TO MAKE THE FOLLOWING CALLS ON THE CLUSTERER
  # INSTANCE TO ACTUALLY CLUSTER THE DATA:

  $clusterer->read_data_from_file();
  $clusterer->kmeans();

  # If you want to directly access the clusters and the cluster centers in your own
  # top-level script, replace the above two statements with:

  $clusterer->read_data_from_file();
  my ($clusters_hash, $cluster_centers_hash) = $clusterer->kmeans();

  # You can subsequently access the clusters directly in your own code, as in:

  foreach my $cluster_id (sort keys %{$clusters_hash}) {
      print "\n$cluster_id   =>   @{$clusters_hash->{$cluster_id}}\n";
  }
  foreach my $cluster_id (sort keys %{$cluster_centers_hash}) {
      print "\n$cluster_id   =>   @{$cluster_centers_hash->{$cluster_id}}\n";
  }


  # CLUSTER VISUALIZATION:

  # You must first set the mask for cluster visualization. This mask tells the module
  # which 2D or 3D subspace of the original data space you wish to visualize the
  # clusters in:

  my $visualization_mask = "111";
  $clusterer->visualize_clusters($visualization_mask);


  # SYNTHETIC DATA GENERATION:

  # The module has been provided with a class method for generating multivariate data
  # for experimenting with clustering.  The data generation is controlled by the
  # contents of the parameter file that is supplied as an argument to the data
  # generator method.  The mean and covariance matrix entries in the parameter file
  # must be according to the syntax shown in the param.txt file in the examples
  # directory. It is best to edit this file as needed:

  my $parameter_file = "param.txt";
  my $out_datafile = "mydatafile.dat";
  Algorithm::KMeans->cluster_data_generator(
                          input_parameter_file => $parameter_file,
                          output_datafile => $out_datafile,
                          number_data_points_per_cluster => $N );

CHANGES

Version 2.05 removes the restriction on the version of Perl that is required. This is based on Srezic's recommendation. He had no problem building and testing the previous version with Perl 5.8.9. Version 2.05 also includes a small augmentation of the code in the method read_data_from_file_csv() for guarding against user errors in the specification of the mask that tells the module which columns of the data file are to be used for clustering.

Version 2.04 allows you to use CSV data files for clustering.

Version 2.03 incorporates minor code cleanup. The main implementation of the module remains unchanged.

Version 2.02 downshifts the version of Perl that is required for this module. The module should work with versions 5.10 and higher of Perl. The implementation code for the module remains unchanged.

Version 2.01 removes many errors in the documentation. The changes made to the module in Version 2.0 were not reflected properly in the documentation page for that version. The implementation code remains unchanged.

Version 2.0 includes significant additional functionality: (1) You now have the option to cluster using the Mahalanobis distance metric (the default is the Euclidean metric); and (2) With the two which_cluster methods that have been added to the module, you can now determine the best cluster for a new data sample after you have created the clusters with the previously available data. Finding the best cluster for a new data sample can be done using either the Euclidean metric or the Mahalanobis metric.

Version 1.40 includes a smart option for seeding the clusters. This option, supplied through the constructor parameter cluster_seeding, means that the clusterer will (1) Subject the data to principal components analysis in order to determine the maximum variance direction; (2) Project the data onto this direction; (3) Find peaks in a smoothed histogram of the projected points; and (4) Use the locations of the highest peaks as initial guesses for the cluster centers. If you don't want to use this option, set cluster_seeding to random. That should work as in the previous version of the module.

Version 1.30 includes a bug fix for the case when the datafile contains empty lines, that is, lines with no data records. Another bug fix in Version 1.30 deals with the case when you want the module to figure out how many clusters to form (this is the K=0 option in the constructor call) and the number of data records is close to the minimum.

Version 1.21 includes fixes to handle the possibility that, when clustering the data for a fixed number of clusters, a cluster may become empty during iterative calculation of cluster assignments of the data elements and the updating of the cluster centers. The code changes are in the assign_data_to_clusters() and update_cluster_centers() subroutines.

Version 1.20 includes an option to normalize the data with respect to its variability along the different coordinates before clustering is carried out.

Version 1.1.1 allows for range limiting the values of K to search through. K stands for the number of clusters to form. This version also declares the module dependencies in the Makefile.PL file.

Version 1.1 is a an object-oriented version of the implementation presented in version 1.0. The current version should lend itself more easily to code extension. You could, for example, create your own class by subclassing from the class presented here and, in your subclass, use your own criteria for the similarity distance between the data points and for the QoC (Quality of Clustering) metric, and, possibly a different rule to stop the iterations. Version 1.1 also allows you to directly access the clusters formed and the cluster centers in your calling script.

SPECIAL USAGE NOTE

If you were directly accessing in your own scripts the clusters produced by the older versions of this module, you'd need to make changes to your code if you wish to use Version 2.0 or higher. Instead of returning arrays of clusters and cluster centers, Versions 2.0 and higher return hashes. This change was made necessary by the logic required for implementing the two new which_cluster methods that were introduced in Version 2.0. These methods return the best cluster for a new data sample from the clusters you created using the existing data. One of the which_cluster methods is based on the Euclidean metric for finding the cluster that is closest to the new data sample, and the other on the Mahalanobis metric. Another point of incompatibility with the previous versions is that you must now explicitly set the cluster_seeding parameter in the call to the constructor to either random or smart. This parameter does not have a default associated with it starting with Version 2.0.

DESCRIPTION

Clustering with K-Means takes place iteratively and involves two steps: 1) assignment of data samples to clusters on the basis of how far the data samples are from the cluster centers; and 2) Recalculation of the cluster centers (and cluster covariances if you are using the Mahalanobis distance metric for clustering).

Obviously, before the two-step approach can proceed, we need to initialize the the cluster centers. How this initialization is carried out is important. The module gives you two very different ways for carrying out this initialization. One option, called the smart option, consists of subjecting the data to principal components analysis to discover the direction of maximum variance in the data space. The data points are then projected on to this direction and a histogram constructed from the projections. Centers of the smoothed histogram are used to seed the clustering operation. The other option is to choose the cluster centers purely randomly. You get the first option if you set cluster_seeding to smart in the constructor, and you get the second option if you set it to random.

How to specify the number of clusters, K, is one of the most vexing issues in any approach to clustering. In some case, we can set K on the basis of prior knowledge. But, more often than not, no such prior knowledge is available. When the programmer does not explicitly specify a value for K, the approach taken in the current implementation is to try all possible values between 2 and some largest possible value that makes statistical sense. We then choose that value for K which yields the best value for the QoC (Quality of Clustering) metric. It is generally believed that the largest value for K should not exceed sqrt(N/2) where N is the number of data samples to be clustered.

What to use for the QoC metric is obviously a critical issue unto itself. In the current implementation, the value of QoC is the ratio of the average radius of the clusters and the average distance between the cluster centers.

Every iterative algorithm requires a stopping criterion. The criterion implemented here is that we stop iterations when there is no re-assignment of the data points during the assignment step.

Ordinarily, the output produced by a K-Means clusterer will correspond to a local minimum for the QoC values, as opposed to a global minimum. The current implementation protects against that when the module constructor is called with the random option for cluster_seeding by trying different randomly selected initial cluster centers and then selecting the one that gives the best overall QoC value.

A K-Means clusterer will generally produce good results if the overlap between the clusters is minimal and if each cluster exhibits variability that is uniform in all directions. When the data variability is different along the different directions in the data space, the results you obtain with a K-Means clusterer may be improved by first normalizing the data appropriately, as can be done in this module when you set the do_variance_normalization option in the constructor. However, as pointed out elsewhere in this documentation, such normalization may actually decrease the performance of the clusterer if the overall data variability along any dimension is more a result of separation between the means than a consequence of intra-cluster variability.

METHODS

The module provides the following methods for clustering, for cluster visualization, for data visualization, for the generation of data for testing a clustering algorithm, and for determining the cluster membership of a new data sample:

new():
    my $clusterer = Algorithm::KMeans->new(datafile        => $datafile,
                                           mask            => $mask,
                                           K               => $K,
                                           cluster_seeding => 'random',     # also try 'smart'
                                           use_mahalanobis_metric => 1,     # also try '0'
                                           terminal_output => 1,     
                                           write_clusters_to_files => 1,
                                          );

A call to new() constructs a new instance of the Algorithm::KMeans class. When $K is a non-zero positive integer, the module will construct exactly that many clusters. However, when $K is 0, the module will find the best number of clusters to partition the data into. As explained in the Description, setting cluster_seeding to smart causes PCA (principal components analysis) to be used for discovering the best choices for the initial cluster centers. If you want purely random decisions to be made for the initial choices for the cluster centers, set cluster_seeding to random.

The data file is expected to contain entries in the following format

   c20  0  10.7087017086940  9.63528386251712  10.9512155258108  ...
   c7   0  12.8025925026787  10.6126270065785  10.5228482095349  ...
   b9   0  7.60118206283120  5.05889245193079  5.82841781759102  ...
   ....
   ....

where the first column contains the symbolic ID tag for each data record and the rest of the columns the numerical information. As to which columns are actually used for clustering is decided by the string value of the mask. For example, if we wanted to cluster on the basis of the entries in just the 3rd, the 4th, and the 5th columns above, the mask value would be N0111 where the character N indicates that the ID tag is in the first column, the character 0 that the second column is to be ignored, and the 1's that follow that the 3rd, the 4th, and the 5th columns are to be used for clustering.

If you wish for the clusterer to search through a (Kmin,Kmax) range of values for K, the constructor should be called in the following fashion:

    my $clusterer = Algorithm::KMeans->new(datafile => $datafile,
                                           mask     => $mask,
                                           Kmin     => 3,
                                           Kmax     => 10,
                                           cluster_seeding => 'smart',   # try 'random' also
                                           terminal_output => 1,     
                                          );

where obviously you can choose any reasonable values for Kmin and Kmax. If you choose a value for Kmax that is statistically too large, the module will let you know. Again, you may choose random for cluster_seeding, the default value being smart.

If you believe that the variability of the data is very different along the different dimensions of the data space, you may get better clustering by first normalizing the data coordinates by the standard-deviations along those directions. When you set the constructor option do_variance_normalization as shown below, the module uses the overall data standard-deviation along a direction for the normalization in that direction. (As mentioned elsewhere in the documentation, such a normalization could backfire on you if the data variability along a dimension is more a result of the separation between the means than a consequence of the intra-cluster variability.):

    my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
                                            mask     => "N111",   
                                            K        => 2,        
                                            cluster_seeding => 'smart',   # try 'random' also
                                            terminal_output => 1,
                                            do_variance_normalization => 1,
                    );

Constructor Parameters

datafile:

This parameter names the data file that contains the multidimensional data records you want the module to cluster.

mask:

This parameter supplies the mask to be applied to the columns of your data file. See the explanation in Synopsis for what this mask looks like.

K:

This parameter supplies the number of clusters you are looking for. If you set this option to 0, that means that you want the module to search for the best value for K. (Keep in mind the fact that searching for the best K may take a long time for large data files.)

Kmin:

If you supply an integer value for Kmin, the search for the best K will begin with that value.

Kmax:

If you supply an integer value for Kmax, the search for the best K will end at that value.

cluster_seeding:

This parameter must be set to either random or smart. Depending on your data, you may get superior clustering with the random option. The choice smart means that the clusterer will (1) subject the data to principal components analysis to determine the maximum variance direction; (2) project the data onto this direction; (3) find peaks in a smoothed histogram of the projected points; and (4) use the locations of the highest peaks as seeds for cluster centers. If the smart option produces bizarre results, try random.

use_mahalanobis_metric:

When set to 1, this option causes Mahalanobis distances to be used for clustering. The default is 0 for this parameter. By default, the module uses the Euclidean distances for clustering. In general, Mahalanobis distance based clustering will fail if your data resides on a lower-dimensional hyperplane in the data space, if you seek too many clusters, and if you do not have a sufficient number of samples in your data file. A necessary requirement for the module to be able to compute Mahalanobis distances is that the cluster covariance matrices be non-singular. (Let's say your data dimensionality is D and the module is considering a cluster that has only d samples in it where d is less than D. In this case, the covariance matrix will be singular since its rank will not exceed d. For the covariance matrix to be non-singular, it must be of full rank, that is, its rank must be D.)

do_variance_normalization:

When set, the module will first normalize the data variance along the different dimensions of the data space before attempting clustering. Depending on your data, this option may or may not result in better clustering.

terminal_output:

This boolean parameter, when not supplied in the call to new(), defaults to 0. When set, you will see in your terminal window the different clusters as lists of the symbolic IDs and their cluster centers. You will also see the QoC (Quality of Clustering) values for the clusters displayed.

write_clusters_to_files:

This parameter is also boolean. When set to 1, the clusters are written out to files that are named in the following manner:

     cluster0.txt 
     cluster1.txt 
     cluster2.txt
     ...
     ...

Before the clusters are written to these files, the module destroys all files with such names in the directory in which you call the module.

read_data_from_file()
    $clusterer->read_data_from_file()
kmeans()
    $clusterer->kmeans();

    or 

    my ($clusters_hash, $cluster_centers_hash) = $clusterer->kmeans();

The first call above works solely by side-effect. The second call also returns the clusters and the cluster centers. See the cluster_and_visualize.pl script in the examples directory for how you can in your own code extract the clusters and the cluster centers from the variables $clusters_hash and $cluster_centers_hash.

get_K_best()
    $clusterer->get_K_best();

This call makes sense only if you supply either the K=0 option to the constructor, or if you specify values for the Kmin and Kmax options. The K=0 and the (Kmin,Kmax) options cause the module to determine the best value for K. Remember, K is the number of clusters the data is partitioned into.

show_QoC_values()
    $clusterer->show_QoC_values();

presents a table with K values in the left column and the corresponding QoC (Quality-of-Clustering) values in the right column. Note that this call makes sense only if you either supply the K=0 option to the constructor, or if you specify values for the Kmin and Kmax options.

visualize_clusters()
    $clusterer->visualize_clusters( $visualization_mask )

The visualization mask here does not have to be identical to the one used for clustering, but must be a subset of that mask. This is convenient for visualizing the clusters in two- or three-dimensional subspaces of the original space.

visualize_data()
    $clusterer->visualize_data($visualization_mask, 'original');

    $clusterer->visualize_data($visualization_mask, 'normed');

This method requires a second argument and, as shown, it must be either the string original or the string normed, the former for the visualization of the raw data and the latter for the visualization of the data after its different dimensions are normalized by the standard-deviations along those directions. If you call the method with the second argument set to normed, but do so without turning on the do_variance_normalization option in the KMeans constructor, it will let you know.

which_cluster_for_new_data_element()

If you wish to determine the cluster membership of a new data sample after you have created the clusters with the existing data samples, you would need to call this method. The which_cluster_for_new_data.pl script in the examples directory shows how to use this method.

which_cluster_for_new_data_element_mahalanobis()

This does the same thing as the previous method, except that it determines the cluster membership using the Mahalanobis distance metric. As for the previous method, see the which_cluster_for_new_data.pl script in the examples directory for how to use this method.

cluster_data_generator()
    Algorithm::KMeans->cluster_data_generator(
                            input_parameter_file => $parameter_file,
                            output_datafile => $out_datafile,
                            number_data_points_per_cluster => 20 );

for generating multivariate data for clustering if you wish to play with synthetic data for clustering. The input parameter file contains the means and the variances for the different Gaussians you wish to use for the synthetic data. See the file param.txt provided in the examples directory. It will be easiest for you to just edit this file for your data generation needs. In addition to the format of the parameter file, the main constraint you need to observe in specifying the parameters is that the dimensionality of the covariance matrix must correspond to the dimensionality of the mean vectors. The multivariate random numbers are generated by calling the Math::Random module. As you would expect, this module requires that the covariance matrices you specify in your parameter file be symmetric and positive definite. Should the covariances in your parameter file not obey this condition, the Math::Random module will let you know.

HOW THE CLUSTERS ARE OUTPUT

When the option terminal_output is set in the call to the constructor, the clusters are displayed on the terminal screen.

When the option write_clusters_to_files is set in the call to the constructor, the module dumps the clusters in files named

    cluster0.txt
    cluster1.txt
    cluster2.txt
    ...
    ...

in the directory in which you execute the module. The number of such files will equal the number of clusters formed. All such existing files in the directory are destroyed before any fresh ones are created. Each cluster file contains the symbolic ID tags of the data samples in that cluster.

The module also leaves in your directory a printable `.png' file that is a point plot of the different clusters. The name of this file is always clustering_results.png.

Also, as mentioned previously, a call to kmeans() in your own code will return the clusters and the cluster centers.

REQUIRED

This module requires the following three modules:

   Math::Random
   Graphics::GnuplotIF
   Math::GSL

With regard to the third item above, what is actually required is the Math::GSL::Matrix module. However, that module is a part of the Math::GSL distribution. The acronym GSL stands for the GNU Scientific Library. Math::GSL is a Perl interface to the GSL C-based library.

THE examples DIRECTORY

The examples directory contains several scripts to help you become familiar with this module. The following script is an example of how the module can be expected to be used most of the time. It calls for clustering to be carried out with a fixed K:

        cluster_and_visualize.pl

The more time you spend with this script, the more comfortable you will become with the use of this module. The script file contains a large comment block that mentions six locations in the script where you have to make decisions about how to use the module.

See the following script if you do not know what value to use for K for clustering your data:

        find_best_K_and_cluster.pl

This script uses the K=0 option in the constructor that causes the module to search for the best K for your data. Since this search is virtually unbounded --- limited only by the number of samples in your data file --- the script may take a long time to run for a large data file. Hence the next script.

If your datafile is too large, you may need to range limit the values of K that are searched through, as in the following script:

        find_best_K_in_range_and_cluster.pl

If you also want to include data normalization (it may reduce the performance of the clusterer in some cases), see the following script:

        cluster_after_data_normalization.pl

When you include the data normalization step and you would like to visualize the data before and after normalization, see the following script:

        cluster_and_visualize_with_data_visualization.pl*

After you are done clustering, let's say you want to find the cluster membership of a new data sample. To see how you can do that, see the script:

        which_cluster_for_new_data.pl

This script returns two answers for which cluster a new data sample belongs to: one using the Euclidean metric to calculate the distances between the new data sample and the cluster centers, and the other using the Mahalanobis metric. If the clusters are strongly elliptical in shape, you are likely to get better results with the Mahalanobis metric. (To see that you can get two different answers using the two different distance metrics, run the which_cluster_for_new_data.pl script on the data in the file mydatafile3.dat. To make this run, note that you have to comment out and uncomment the lines at four different locations in the script.)

The examples directory also contains the following support scripts:

For generating the data for experiments with clustering:

        data_generator.pl

For cleaning up the examples directory:

        cleanup_directory.pl

The examples directory also includes a parameter file, param.txt, for generating synthetic data for clustering. Just edit this file if you would like to generate your own multivariate data for clustering. The parameter file is for the 3D case, but you can generate data with any dimensionality through appropriate entries in the parameter file.

EXPORT

None by design.

CAVEATS

K-Means based clustering usually does not work well when the clusters are strongly overlapping and when the extent of variability along the different dimensions is different for the different clusters. The module does give you the ability to normalize the variability in your data with the constructor option do_variance_normalization. However, as described elsewhere, this may actually reduce the performance of the clusterer if the data variability along a direction is more a result of the separation between the means than because of intra-cluster variability. For better clustering with difficult-to-cluster data, you could try using the author's Algorithm::ExpectationMaximization module.

BUGS

Please notify the author if you encounter any bugs. When sending email, please place the string 'KMeans' in the subject line.

INSTALLATION

Download the archive from CPAN in any directory of your choice. Unpack the archive with a command that on a Linux machine would look like:

    tar zxvf Algorithm-KMeans-2.05.tar.gz

This will create an installation directory for you whose name will be Algorithm-KMeans-2.05. Enter this directory and execute the following commands for a standard install of the module if you have root privileges:

    perl Makefile.PL
    make
    make test
    sudo make install

If you do not have root privileges, you can carry out a non-standard install the module in any directory of your choice by:

    perl Makefile.PL prefix=/some/other/directory/
    make
    make test
    make install

With a non-standard install, you may also have to set your PERL5LIB environment variable so that this module can find the required other modules. How you do that would depend on what platform you are working on. In order to install this module in a Linux machine on which I use tcsh for the shell, I set the PERL5LIB environment variable by

    setenv PERL5LIB /some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

If I used bash, I'd need to declare:

    export PERL5LIB=/some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

THANKS

I thank Slaven for pointing out that I needed to downshift the required version of Perl for this module. Fortunately, I had access to an old machine still running Perl 5.10.1. The current version, 2.02, is based on my testing the module on that machine.

I added two which_cluster methods in Version 2.0 as a result of an email from Jerome White who expressed a need for such methods in order to determine the best cluster for a new data record after you have successfully clustered your existing data. Thanks Jerome for your feedback!

It was an email from Nadeem Bulsara that prompted me to create Version 1.40 of this module. Working with Version 1.30, Nadeem noticed that occasionally the module would produce variable clustering results on the same dataset. I believe that this variability was caused (at least partly) by the purely random mode that was used in Version 1.30 for the seeding of the cluster centers. Version 1.40 now includes a smart mode. With the new mode the clusterer uses a PCA (Principal Components Analysis) of the data to make good guesses for the cluster centers. However, depending on how the data is jumbled up, it is possible that the new mode will not produce uniformly good results in all cases. So you can still use the old mode by setting cluster_seeding to random in the constructor. Thanks Nadeem for your feedback!

Version 1.30 resulted from Martin Kalin reporting problems with a very small data set. Thanks Martin!

Version 1.21 came about in response to the problems encountered by Luis Fernando D'Haro with version 1.20. Although the module would yield the clusters for some of its runs, more frequently than not the module would abort with an "empty cluster" message for his data. Luis Fernando has also suggested other improvements (such as clustering directly from the contents of a hash) that I intend to make in future versions of this module. Thanks Luis Fernando.

Chad Aeschliman was kind enough to test out the interface of this module and to give suggestions for its improvement. His key slogan: "If you cannot figure out how to use a module in under 10 minutes, it's not going to be used." That should explain the longish Synopsis included here.

AUTHOR

Avinash Kak, kak@purdue.edu

If you send email, please place the string "KMeans" in your subject line to get past my spam filter.

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

 Copyright 2014 Avinash Kak