The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::LinearManifoldDataClusterer --- for clustering data that resides on a low-dimensional manifold in a high-dimensional measurement space

SYNOPSIS

  #  You'd begin with:

  use Algorithm::LinearManifoldDataClusterer;

  #  You'd next name the data file:

  my $datafile = "mydatafile.csv";

  #  Your data must be in the CSV format, with one of the columns containing a unique
  #  symbolic tag for each data record. You tell the module which column has the
  #  symbolic tag and which columns to use for clustering through a mask such as

  my $mask = "N111";

  #  which says that the symbolic tag is in the first column and that the numerical
  #  data in the next three columns is to be used for clustering.  If your data file
  #  had, say, five columns and you wanted only the last three columns to be
  #  clustered, the mask would become `N0111' assuming that that the symbolic tag is
  #  still in the first column.

  #  Now you must construct an instance of the clusterer through a call such as:

  my $clusterer = Algorithm::LinearManifoldDataClusterer->new(
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 3,     
                                    P        => 2,     
                                    max_iterations => 15,
                                    cluster_search_multiplier => 2,
                                    delta_reconstruction_error => 0.001,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 1,
                                    make_png_for_each_iteration => 1,
                  );

  #  where the parameter K specifies the number of clusters you expect to find in
  #  your data and the parameter P is the dimensionality of the manifold on which the
  #  data resides.  The parameter cluster_search_multiplier is for increasing the
  #  odds that the random seeds chosen initially for clustering will populate all the
  #  clusters.  Set this parameter to a low number like 2 or 3. The parameter
  #  max_iterations places a hard limit on the number of iterations that the
  #  algorithm is allowed.  The actual number of iterations is controlled by the
  #  parameter delta_reconstruction_error.  The iterations stop when the change in
  #  the total "reconstruction error" from one iteration to the next is smaller than
  #  the value specified by delta_reconstruction_error.

  #  Next, you must get the module to read the data for clustering:

  $clusterer->get_data_from_csv();

  #  Finally, you invoke linear manifold clustering by:

  my $clusters = $clusterer->linear_manifold_clusterer();

  #  The value returned by this call is a reference to an array of anonymous arrays,
  #  with each anonymous array holding one cluster.  If you wish, you can have the
  #  module write the clusters to individual files by the following call:

  $clusterer->write_clusters_to_files($clusters);

  #  If you want to see how the reconstruction error changes with the iterations, you
  #  can make the call:

  $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

  #  When your data is 3-dimensional and when the clusters reside on a surface that
  #  is more or less spherical, you can visualize the clusters by calling

  $clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

  #  where the first argument is a label to be displayed in the 3D plot and the
  #  second argument the value returned by calling linear_manifold_clusterer().

  #  SYNTHETIC DATA GENERATION:

  #  The module includes an embedded class, DataGenerator, for generating synthetic
  #  three-dimensional data that can be used to experiment with the clustering code.
  #  The synthetic data, written out to a CSV file, consists of Gaussian clusters on
  #  the surface of a sphere.  You can control the number of clusters, the width of
  #  each cluster, and the number of samples in the clusters by giving appropriate
  #  values to the constructor parameters as shown below:

  use strict;
  use Algorithm::LinearManifoldDataClusterer;

  my $output_file = "4_clusters_on_a_sphere_1000_samples.csv";

  my $training_data_gen = DataGenerator->new(
                             output_file => $output_file,
                             cluster_width => 0.015,
                             total_number_of_samples_needed => 1000,
                             number_of_clusters_on_sphere => 4,
                             show_hidden_in_3D_plots => 0,
                          );
  $training_data_gen->gen_data_and_write_to_csv();
  $training_data_gen->visualize_data_on_sphere($output_file);

CHANGES

Version 1.01: Typos and other errors removed in the documentation. Also included in the documentation a link to a tutorial on data processing on manifolds.

DESCRIPTION

If you are new to machine learning and data clustering on linear and nonlinear manifolds, your first question is likely to be: What is a manifold? A manifold is a space that is locally Euclidean. And a space is locally Euclidean if it allows for the points in a small neighborhood to be represented by, say, the Cartesian coordinates and if the distances between the points in the neighborhood are given by the Euclidean metric. For an example, the set of all points on the surface of a sphere does NOT constitute a Euclidean space. Nonetheless, if you confined your attention to a small enough neighborhood around a point, the space would seem to be locally Euclidean. The surface of a sphere is a 2-dimensional manifold embedded in a 3-dimensional space. A plane in a 3-dimensional space is also a 2-dimensional manifold. You would think of the surface of a sphere as a nonlinear manifold, whereas a plane would be a linear manifold. However, note that any nonlinear manifold is locally a linear manifold. That is, given a sufficiently small neighborhood on a nonlinear manifold, you can always think of it as a locally flat surface.

As to why we need machine learning and data clustering on manifolds, there exist many important applications in which the measured data resides on a nonlinear manifold. For example, when you record images of a human face from different angles, all the image pixels taken together fall on a low-dimensional surface in a high-dimensional measurement space. The same is believed to be true for the satellite images of a land mass that are recorded with the sun at different angles with respect to the direction of the camera.

Reducing the dimensionality of the sort of data mentioned above is critical to the proper functioning of downstream classification algorithms, and the most popular traditional method for dimensionality reduction is the Principal Components Analysis (PCA) algorithm. However, using PCA is tantamount to passing a linear least-squares hyperplane through the surface on which the data actually resides. As to why that might be a bad thing to do, just imagine the consequences of assuming that your data falls on a straight line when, in reality, it falls on a strongly curving arc. This is exactly what happens with PCA --- it gives you a linear manifold approximation to your data that may actually reside on a curved surface.

That brings us to the purpose of this module, which is to cluster data that resides on a nonlinear manifold. Since a nonlinear manifold is locally linear, we can think of each data cluster on a nonlinear manifold as falling on a locally linear portion of the manifold, meaning on a hyperplane. The logic of the module is based on finding a set of hyperplanes that best describes the data, with each hyperplane derived from a local data cluster. This is like constructing a piecewise linear approximation to data that falls on a curve as opposed to constructing a single straight line approximation to all of the data. So whereas the frequently used PCA algorithm gives you a single hyperplane approximation to all your data, what this module returns is a set of hyperplane approximations, with each hyperplane derived by applying the PCA algorithm locally to a data cluster.

That brings us to the problem of how to actually discover the best set of hyperplane approximations to the data. What is probably the most popular algorithm today for that purpose is based on the following key idea: Given a set of subspaces to which a data element can be assigned, you assign it to that subspace for which the reconstruction error is the least. But what do we mean by a subspace and what is reconstruction error?

To understand the notions of subspace and reconstruction-error, let's revisit the traditional approach of dimensionality reduction by the PCA algorithm. The PCA algorithm consists of: (1) Subtracting from each data element the global mean of the data; (2) Calculating the covariance matrix of the data; (3) Carrying out an eigendecomposition of the covariance matrix and ordering the eigenvectors according to decreasing values of the corresponding eigenvalues; (4) Forming a subspace by discarding the trailing eigenvectors whose corresponding eigenvalues are relatively small; and, finally, (5) projecting all the data elements into the subspace so formed. The error incurred in representing a data element by its projection into the subspace is known as the reconstruction error. This error is the projection of the data element into the space spanned by the discarded trailing eigenvectors.

In linear-manifold based machine learning, instead of constructing a single subspace in the manner described above, we construct a set of subspaces, one for each data cluster on the nonlinear manifold. After the subspaces have been constructed, a data element is assigned to that subspace for which the reconstruction error is the least. On the face of it, this sounds like a chicken-and-egg sort of a problem. You need to have already clustered the data in order to construct the subspaces at different places on the manifold so that you can figure out which cluster to place a data element in.

Such problems, when they do possess a solution, are best tackled through iterative algorithms in which you start with a guess for the final solution, you rearrange the measured data on the basis of the guess, and you then use the new arrangement of the data to refine the guess. Subsequently, you iterate through the second and the third steps until you do not see any discernible changes in the new arrangements of the data. This forms the basis of the clustering algorithm that is described under Phase 1 in the section that follows. This algorithm was first proposed in the article "Dimension Reduction by Local Principal Component Analysis" by Kambhatla and Leen that appeared in the journal Neural Computation in 1997.

Unfortunately, experiments show that the algorithm as proposed by Kambhatla and Leen is much too sensitive to how the clusters are seeded initially. To get around this limitation of the basic clustering-by-minimization-of-reconstruction-error, this module implements a two phased approach. In Phase 1, we introduce a multiplier effect in our search for clusters by looking for M*K clusters instead of the main K clusters. In this manner, we increase the odds that each original cluster will be visited by one or more of the M*K randomly selected seeds at the beginning, where M is the integer value given to the constructor parameter cluster_search_multiplier. Subsequently, we merge the clusters that belong together in order to form the final K clusters. That work is done in Phase 2 of the algorithm.

For the cluster merging operation in Phase 2, we model the M*K clusters as the nodes of an attributed graph in which the weight given to an edge connecting a pair of nodes is a measure of the similarity between the two clusters corresponding to the two nodes. Subsequently, we use spectral clustering to merge the most similar nodes in our quest to partition the data into K clusters. For that purpose, we use the Shi-Malik normalized cuts algorithm. The pairwise node similarity required by this algorithm is measured by the pairwise_cluster_similarity() method of the LinearManifoldDataClusterer class. The smaller the overall reconstruction error when all of the data elements in one cluster are projected into the other's subspace and vice versa, the greater the similarity between two clusters. Additionally, the smaller the distance between the mean vectors of the clusters, the greater the similarity between two clusters. The overall similarity between a pair of clusters is a combination of these two similarity measures.

For additional information regarding the theoretical underpinnings of the algorithm implemented in this module, visit https://engineering.purdue.edu/kak/Tutorials/ClusteringDataOnManifolds.pdf

SUMMARY OF THE ALGORITHM

We now present a summary of the two phases of the algorithm implemented in this module. Note particularly the important role played by the constructor parameter cluster_search_multiplier. It is only when the integer value given to this parameter is greater than 1 that Phase 2 of the algorithm kicks in.

Phase 1:

Through iterative minimization of the total reconstruction error, this phase of the algorithm returns M*K clusters where K is the actual number of clusters you expect to find in your data and where M is the integer value given to the constructor parameter cluster_search_multiplier. As previously mentioned, on account of the sensitivity of the reconstruction-error based clustering to how the clusters are initially seeded, our goal is to look for M*K clusters with the idea of increasing the odds that each of the K clusters will see at least one seed at the beginning of the algorithm.

Step 1:

Randomly choose M*K data elements to serve as the seeds for that many clusters.

Step 2:

Construct initial M*K clusters by assigning each data element to that cluster whose seed it is closest to.

Step 3:

Calculate the mean and the covariance matrix for each of the M*K clusters and carry out an eigendecomposition of the covariance matrix. Order the eigenvectors in decreasing order of the corresponding eigenvalues. The first P eigenvectors define the subspace for that cluster. Use the space spanned by the remaining eigenvectors --- we refer to them as the trailing eigenvectors --- for calculating the reconstruction error.

Step 4:

Taking into account the mean associated with each cluster, re-cluster the entire data set on the basis of the least reconstruction error. That is, assign each data element to that subspace for which it has the smallest reconstruction error. Calculate the total reconstruction error associated with all the data elements. (See the definition of the reconstruction error in the Description section.)

Step 5:

Stop iterating if the change in the total reconstruction error from the previous iteration to the current iteration is less than the value specified by the constructor parameter delta_reconstruction_error. Otherwise, go back to Step 3.

Phase 2:

This phase of the algorithm uses graph partitioning to merge the M*K clusters back into the K clusters you expect to see in your data. Since the algorithm whose steps are presented below is invoked recursively, let's assume that we have N clusters that need to be merged by invoking the Shi-Malik spectral clustering algorithm as described below:

Step 1:

Form a graph whose N nodes represent the N clusters. (For the very first invocation of this step, we have N = M*K.)

Step 2:

Construct an NxN similarity matrix for the nodes in the graph. The (i,j)-th element of this matrix is the pairwise similarity between the clusters indexed i and j. Calculate this similarity on the basis of the following two criteria: (1) The total reconstruction error when the data elements in the cluster indexed j are projected into the subspace for the cluster indexed i and vice versa. And (2) The distance between the mean vectors associated with the two clusters.

Step 3:

Calculate the symmetric normalized Laplacian of the similarity matrix. We use A to denote the symmetric normalized Laplacian.

Step 4:

Carry out an eigendecomposition of the A matrix and choose the eigenvector corresponding to the second smallest eigenvalue for bipartitioning the graph on the basis of the sign of the values in the eigenvector.

Step 5:

If the bipartition of the previous step yields one-versus-the-rest kind of a partition, add the `one' cluster to the output list of clusters and invoke graph partitioning recursively on the `rest' by going back to Step 1. On the other hand, if the cardinality of both the partitions returned by Step 4 exceeds 1, invoke graph partitioning recursively on both partitions. Stop when the list of clusters in the output list equals K.

Step 6:

In general, the K clusters obtained by recursive graph partitioning will not cover all of the data. So, for the final step, assign each data element not covered by the K clusters to that cluster for which its reconstruction error is the least.

FAIL-FIRST BIAS OF THE MODULE

As you would expect for all such iterative algorithms, the module carries no theoretical guarantee that it will give you correct results. But what does that mean? Suppose you create synthetic data that consists of Gaussian looking disjoint clusters on the surface of a sphere, would the module always succeed in separating out the clusters? The module carries no guarantees to that effect --- especially considering that Phase 1 of the algorithm is sensitive to how the clusters are seeded at the beginning. Although this sensitivity is mitigated by the cluster merging step when greater-than-1 value is given to the constructor option cluster_search_multiplier, a plain vanilla implementation of the steps in Phase 1 and Phase 2 would nonetheless carry significant risk that you'll end up with incorrect clustering results.

To further reduce this risk, the module has been programmed so that it terminates immediately if it suspects that the cluster solution being developed is unlikely to be fruitful. The heuristics used for such terminations are conservative --- since the cost of termination is only that the user will have to run the code again, which at worst only carries an element of annoyance with it. The three "Fail First" heuristics currently programmed into the module are based on simple "unimodality testing", testing for "congruent clusters," and testing for dominant cluster support in the final stage of the recursive invocation of the graph partitioning step. The unimodality testing is as elementary as it can be --- it only checks for the number of data samples within a certain radius of the mean in relation to the total number of data samples in the cluster.

When the program terminates under such conditions, it prints out the following message in your terminal window:

    Bailing out!  

Given the very simple nature of testing that is carried for the "Fail First" bias, do not be surprised if the results you get for your data simply look wrong. If most runs of the module produce wrong results for your application, that means that the module logic needs to be strengthened further. The author of this module would love to hear from you if that is the case.

METHODS

The module provides the following methods for linear-manifold based clustering, for cluster visualization, and for the generation of data for testing the clustering algorithm:

new():
    my $clusterer = Algorithm::LinearManifoldDataClusterer->new(
                                        datafile                    => $datafile,
                                        mask                        => $mask,
                                        K                           => $K,
                                        P                           => $P,     
                                        cluster_search_multiplier   => $C,
                                        max_iterations              => $max_iter,
                                        delta_reconstruction_error  => 0.001,
                                        terminal_output             => 1,
                                        write_clusters_to_files     => 1,
                                        visualize_each_iteration    => 1,
                                        show_hidden_in_3D_plots     => 1,
                                        make_png_for_each_iteration => 1,
                    );

A call to new() constructs a new instance of the Algorithm::LinearManifoldDataClusterer class.

Constructor Parameters

datafile:

This parameter names the data file that contains the multidimensional data records you want the module to cluster. This file must be in CSV format and each record in the file must include a symbolic tag for the record. Here are first few rows of such a CSV file in the examples directory:

    d_161,0.0739248630173239,0.231119293395665,-0.970112873251437
    a_59,0.459932215884786,0.0110216469739639,0.887885623314902
    a_225,0.440503220903039,-0.00543366086464691,0.897734586447273
    a_203,0.441656364946433,0.0437191337788422,0.896118459046532
    ...
    ...

What you see in the first column --- d_161, a_59, a_225, a_203 --- are the symbolic tags associated with four 3-dimensional data records.

mask:

This parameter supplies the mask to be applied to the columns of your data file. For the data file whose first few records are shown above, the mask should be N111 since the symbolic tag is in the first column of the CSV file and since, presumably, you want to cluster the data in the next three columns.

K:

This parameter supplies the number of clusters you are looking for.

P:

This parameter specifies the dimensionality of the manifold on which the data resides.

cluster_search_multiplier:

As should be clear from the Summary of the Algorithm section, this parameter plays a very important role in the successful clustering of your data. As explained in Description, the basic algorithm used for clustering in Phase 1 --- clustering by the minimization of the reconstruction error --- is sensitive to the choice of the cluster seeds that are selected randomly at the beginning of the algorithm. Should it happen that the seeds miss one or more of the clusters, the clustering produced is likely to not be correct. By giving an integer value to cluster_search_multiplier that is greater than 1, you'll increase the odds that the randomly selected seeds will see all clusters. When you set cluster_search_multiplier to M, you ask Phase 1 of the algorithm to construct M*K clusters as opposed to just K clusters. Subsequently, in Phase 2, the module uses inter-cluster similarity based graph partitioning to merge the M*K clusters into K clusters.

max_iterations:

This hard limits the number of iterations in Phase 1 of the algorithm. Ordinarily, the iterations stop automatically when the change in the total reconstruction error from one iteration to the next is less than the value specified by the parameter delta_reconstruction_error.

delta_reconstruction_error:

It is this parameter that determines when the iterations will actually terminate in Phase 1 of the algorithm. When the difference in the total reconstruction error from one iteration to the next is less than the value given to this parameter, the iterations come to an end. IMPORTANT: I have noticed that the larger the number of data samples that need to be clustered, the larger must be the value give to this parameter. That makes intuitive sense since the total reconstruction error is the sum of all such errors for all of the data elements. Unfortunately, the best value for this parameter does NOT appear to depend linearly on the total number of data records to be clustered.

terminal_output:

When this parameter is set, you will see in your terminal window the different clusters as lists of the symbolic tags associated with the data records. You will also see in your terminal window the output produced by the different steps of the graph partitioning algorithm as smaller clusters are merged to form the final K clusters --- assuming that you set the parameter cluster_search_multiplier to an integer value that is greater than 1.

visualize_each_iteration:

As its name implies, when this option is set to 1, you'll see 3D plots of the clustering results for each iteration --- but only if your data is 3-dimensional.

show_hidden_in_3D_plots:

This parameter is important for controlling the visualization of the clusters on the surface of a sphere. If the clusters are too spread out, seeing all of the clusters all at once can be visually confusing. When you set this parameter, the clusters on the back side of the sphere will not be visible. Note that no matter how you set this parameter, you can interact with the 3D plot of the data and rotate it with your mouse pointer to see all of the data that is output by the clustering code.

make_png_for_each_iteration:

If you set this option to 1, the module will output a Gnuplot in the form of a PNG image for each iteration in Phase 1 of the algorithm. In Phase 2, the module will output the clustering result produced by the graph partitioning algorithm.

get_data_from_csv():
    $clusterer->get_data_from_csv();

As you can guess from its name, the method extracts the data from the CSV file named in the constructor.

linear_manifold_clusterer():
    $clusterer->linear_manifold_clusterer();   

    or 

    my $clusters = $clusterer->linear_manifold_clusterer();

This is the main call to the linear-manifold based clusterer. The first call works by side-effect, meaning that you will see the clusters in your terminal window and they would be written out to disk files (depending on the constructor options you have set). The second call also returns the clusters as a reference to an array of anonymous arrays, each holding the symbolic tags for a cluster.

display_reconstruction_errors_as_a_function_of_iterations():
    $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

This method would normally be called after the clustering is completed to see how the reconstruction errors decreased with the iterations in Phase 1 of the overall algorithm.

write_clusters_to_files():
    $clusterer->write_clusters_to_files($clusters);

As its name implies, when you call this methods, the final clusters would be written out to disk files. The files have names like:

     cluster0.txt 
     cluster1.txt 
     cluster2.txt
     ...
     ...

Before the clusters are written to these files, the module destroys all files with such names in the directory in which you call the module.

visualize_clusters_on_sphere():
    $clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

or

    $clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

If your data is 3-dimensional and it resides on the surface of a sphere (or in the vicinity of such a surface), you may be able to use these methods for the visualization of the clusters produced by the algorithm. The first invocation produces a Gnuplot in a terminal window that you can rotate with your mouse pointer. The second invocation produces a `.png' image of the plot.

auto_retry_clusterer():
    $clusterer->auto_retry_clusterer();

or

    my $clusters = $clusterer->auto_retry_clusterer();

As mentioned earlier, the module is programmed in such a way that it is more likely to fail than to give you a wrong answer. If manually trying the clusterer repeatedly on a data file is frustrating, you can use auto_retry_clusterer() to automatically make repeated attempts for you. See the script example4.pl for how you can use auto_retry_clusterer() in your own code.

GENERATING SYNTHETIC DATA FOR EXPERIMENTING WITH THE CLUSTERER

The module file also contains a class named DataGenerator for generating synthetic data for experimenting with linear-manifold based clustering. At this time, only 3-dimensional data that resides in the form of Gaussian clusters on the surface of a sphere is generated. The generated data is placed in a CSV file. You construct an instance of the DataGenerator class by a call like:

new():
    my $training_data_gen = DataGenerator->new(
                                 output_file => $output_file,
                                 cluster_width => 0.0005,
                                 total_number_of_samples_needed => 1000,
                                 number_of_clusters_on_sphere => 4,
                                 show_hidden_in_3D_plots => 0,
                            );

Parameters for the DataGenerator constructor:

output_file:

The numeric values are generated using a bivariate Gaussian distribution whose two independent variables are the azimuth and the elevation angles on the surface of a unit sphere. The mean of each cluster is chosen randomly and its covariance set in proportion to the value supplied for the cluster_width parameter.

cluster_width:

This parameter controls the spread of each cluster on the surface of the unit sphere.

total_number_of_samples_needed:

As its name implies, this parameter specifies the total number of data samples that will be written out to the output file --- provided this number is divisible by the number of clusters you asked for. If the divisibility condition is not satisfied, the number of data samples actually written out will be the closest it can be to the number you specify for this parameter under the condition that equal number of samples will be created for each cluster.

number_of_clusters_on_sphere:

Again as its name implies, this parameters specifies the number of clusters that will be produced on the surface of a unit sphere.

show_hidden_in_3D_plots:

This parameter is important for the visualization of the clusters and it controls whether you will see the generated data on the back side of the sphere. If the clusters are not too spread out, you can set this parameter to 0 and see all the clusters all at once. However, when the clusters are spread out, it can be visually confusing to see the data on the back side of the sphere. Note that no matter how you set this parameter, you can interact with the 3D plot of the data and rotate it with your mouse pointer to see all of the data that is generated.

gen_data_and_write_to_csv():
    $training_data_gen->gen_data_and_write_to_csv();

As its name implies, this method generates the data according to the parameters specified in the DataGenerator constructor.

visualize_data_on_sphere():
    $training_data_gen->visualize_data_on_sphere($output_file);

You can use this method to visualize the clusters produced by the data generator. Since the clusters are located at randomly selected points on a unit sphere, by looking at the output visually, you can quickly reject what the data generator has produced and try again.

HOW THE CLUSTERS ARE OUTPUT

When the option terminal_output is set in the constructor of the LinearManifoldDataClusterer class, the clusters are displayed on the terminal screen.

And, when the option write_clusters_to_files is set in the same constructor, the module dumps the clusters in files named

    cluster0.txt
    cluster1.txt
    cluster2.txt
    ...
    ...

in the directory in which you execute the module. The number of such files will equal the number of clusters formed. All such existing files in the directory are destroyed before any fresh ones are created. Each cluster file contains the symbolic tags of the data samples in that cluster.

Assuming that the data dimensionality is 3, if you have set the constructor parameter visualize_each_iteration, the module will deposit in your directory printable PNG files that are point plots of the different clusters in the different iterations of the algorithm. Such printable files are also generated for the initial clusters at the beginning of the iterations and for the final clusters in Phase 1 of the algorithm. You will also see in your directory a PNG file for the clustering result produced by graph partitioning in Phase 2.

Also, as mentioned previously, a call to linear_manifold_clusterer() in your own code will return the clusters to you directly.

REQUIRED

This module requires the following modules:

    List::Util
    File::Basename
    Math::Random
    Graphics::GnuplotIF
    Math::GSL::Matrix
    POSIX

THE examples DIRECTORY

The examples directory contains the following four scripts that you might want to play with in order to become familiar with the module:

    example1.pl

    example2.pl

    example3.pl

    example4.pl

These scripts demonstrate linear-manifold based clustering on the 3-dimensional data in the following three CSV files:

    3_clusters_on_a_sphere_498_samples.csv            (used in example1.pl and example4.pl)

    3_clusters_on_a_sphere_3000_samples.csv           (used in example2.pl)

    4_clusters_on_a_sphere_1000_samples.csv           (used in example3.pl)

Note that even though the first two of these files both contain exactly three clusters, the clusters look very different in the two data files. The clusters are much more spread out in 3_clusters_on_a_sphere_3000_samples.csv.

The code in example4.pl is special because it shows how you can call the auto_retry_clusterer() method of the module for automatic repeated invocations of the clustering program until success is achieved. The value of the constructor parameter cluster_search_multiplier is set to 1 in example4.pl, implying that when you execute example4.pl you will not be invoking Phase 2 of the algorithm. You might wish to change the value given to the parameter cluster_search_multiplier to see how it affects the number of attempts needed to achieve success.

The examples directory also includes PNG image files that show some intermediate and the best final results that can be achieved by the first three examples, that is, for the scripts example1.pl, example2.pl, and example3.pl. If you are on a Linux machine and if you have the ImageMagick library installed, you can use the display command to see the results in the PNG images. The results you get with your own runs of the three example scripts may or may not look like what you see in the outputs shown in the PNG files depending on how the module seeds the clusters. Your best results should look like what you see in the PNG images.

The examples directory also contains the following utility scripts:

For generating the data for experiments with clustering:

    generate_data_on_a_sphere.pl

For visualizing the data generated by the above script:

    data_visualizer.pl

For cleaning up the examples directory:

    cleanup_directory.pl

Invoking the cleanup_directory.pl script will get rid of all the PNG image files that are generated by the module when you run it with the constructor option make_png_for_each_iteration set to 1.

EXPORT

None by design.

CAVEATS

The performance of the algorithm depends much on the values you choose for the constructor parameters. And, even for the best choices for the parameters, the algorithm is not theoretically guaranteed to return the best results.

Even after you have discovered the best choices for the constructor parameters, the best way to use this module is to try it multiple times on any given data and accept only those results that make the best intuitive sense.

The module was designed with the hope that it would rather fail than give you wrong results. So if you see it failing a few times before it returns a good answer, that's a good thing.

However, if the module fails too often or is too quick to give you wrong answers, that means the module is not working on your data. If that happens, I'd love to hear from you. That might indicate to me how I should enhance the power of this module for its future versions.

BUGS

Please notify the author if you encounter any bugs. When sending email, please place the string 'LinearManifoldDataClusterer' in the subject line.

INSTALLATION

Download the archive from CPAN in any directory of your choice. Unpack the archive with a command that on a Linux machine would look like:

    tar zxvf Algorithm-LinearManifoldDataClusterer-1.01.tar.gz

This will create an installation directory for you whose name will be Algorithm-LinearManifoldDataClusterer-1.01. Enter this directory and execute the following commands for a standard install of the module if you have root privileges:

    perl Makefile.PL
    make
    make test
    sudo make install

If you do not have root privileges, you can carry out a non-standard install the module in any directory of your choice by:

    perl Makefile.PL prefix=/some/other/directory/
    make
    make test
    make install

With a non-standard install, you may also have to set your PERL5LIB environment variable so that this module can find the required other modules. How you do that would depend on what platform you are working on. In order to install this module in a Linux machine on which I use tcsh for the shell, I set the PERL5LIB environment variable by

    setenv PERL5LIB /some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

If I used bash, I'd need to declare:

    export PERL5LIB=/some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

THANKS

I have learned much from my conversations with Donghun Kim whose research on face recognition in the wild involves clustering image data on manifolds. I have also had several fruitful conversations with Bharath Kumar Comandur and Tanmay Prakash with regard to the ideas that are incorporated in this module.

AUTHOR

Avinash Kak, kak@purdue.edu

If you send email, please place the string "LinearManifoldDataClusterer" in your subject line to get past my spam filter.

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

 Copyright 2015 Avinash Kak