NAME
Microarray::DataMatrix::SmallDataMatrix - abstraction to matrices that fit in memory
Abstract
smallDataMatrix is an abstract class, which provides as abstraction to an in memory matrix of data. It should not be subclassed by concrete subclasses. Instead, the subclass anySizeDataMatrix, can be subclassed with concrete subclasses, which will provide abstractions to dataMatrices stored in particular file formats, such as pcl files.
Overall Logic
Internally, all data are read into memory, and the actual data matrix is stored as a 2 dimensional array. The indexes of the array which are still valid are stored in internal hashes, such that subsequent manipulations of the data only consider the data that have not been filtered out. As rows or columns are filtered out by some of the methods, the entries for these rows or columns are deleted from the hashes that track valid data. Thus when data are redumped to a file, only those data that have not been filtered are printed out.
Construction
As smallDataMatrix is an abstract class, it has no constructor. However, the subclass, anySizeDataMatrix, once it has determined that a matrix will indeed fit into memory, MUST call the _init method, which will result in all data being read into memory.
_init
This protected method will read in all the data for the matrix, and store it in memory. It MUST be called during initialization of a subclass object, before any other methods can be called by the client.
Usage:
$self->_init;
or:
$self->SUPER::_init;
private utility methods
__readInAndStoreAllData
This private method uses the _dataLine method from a concrete subclass to request all data from the matrix, and read it into memory, then store it internally.
Usage :
$self->__readInAndStoreAllData;
__columnAverage
This private method calculates and returns the average value for a column, depending on whether the mean or median was requested.
Usage:
my $average = $self->__columnAverage($column, 'mean');
__centerColumn
This private method centers the data for a single column, by subtracting the average from every valid value.
Usage :
$self->__centerColumn($column, $average);
__filterRowsByCount
This private method filters out rows that do not have a count for some particular property above or equal to a threshold. It accepts a hash reference, that hashes the row number to a count, and a threshold value. Note that not all rows are necessarily entered into the hash, so this method iterates over all rows, and checks each valid one for its count in the hash, then invalidates those with too low a count.
Usage:
$self->__filterRowsByCount(\%count, $numColumns);
__validColumnsStdDevAndMeanHashRefs
This private method calculates the standard deviations for each valid column, and returns references to two hashes. Both have the column index as the key, and one has the standard deviation as the values, the other has the column means as the values.
mean = Sum of values/n
std dev = square root (((n * sum of (x^2)) - (sum of x)^2)/n(n-1))
Usage:
my ($stddevHashRef, $meansHashRef) = $self->__validColumnsStdDevAndMeanHashRefs($lineEnding);
__dieIfNegativeDataExistInMatrix
This private function dies, with an appropriate error message, if any negative data value is found within the matrix.
Usage:
$self->__dieIfNegativeDataExistInMatrix;
private setter methods
__setMatrix
This private setter method receives a reference to an array of array references (which contains the matrix itself), and stores it as a private attribute of the object.
Usage :
$self->__setMatrix(\@matrix);
__setPercentiles
This private setter method receives a pointer to a hash of hashes that stores the percentiles of the data. The first key in the hash is the row from which an element of data came, and the second is the column. The value is the percentile for that piece of data in the column in which it is found.
Usage :
$self->__setPercentiles(\%percentiles);
__pareDownPercentiles
This private method deletes entries in the percentiles hash that are not needed - it is really just to save memory....
Usage :
$self->__pareDownPercentiles;
__invalidateMatrixRow
This private mutator method makes a row invalid. The invalidation is actually done by the the super class, but here, to save memory, we delete the data from the in memory matrix itself. This method is not undoable, because the invalidation also deletes the data for the row.
Usage :
$self->__invalidateMatrixRow($row);
private getter methods
__matrixArrayRef
This private method returns a reference to the 2-D array of data owned by the self object.
Usage :
my $matrixArrayRef = $self->__matrixArrayRef;
Protected data transformation/filtering methods
Note: These methods provide the backend nuts and bolts for a transformation or filtering. They should only be called by the immediate subclass, anySizeDataMatrix, and not directly by the concrete subclasses of anySizeDataMatrix. In addition, note that the companion bigDataMatrix must (and does) provide identical interfaces to these methods (obviously with different underlying implementations), such that anySizeDataMatrix can call the methods without regard to the size of the underlying matrix.
_centerColumns
This protected method centers each column of data, and returns the largest absolute value that was used in the centering. The caller of the method must specify whether to center by means or medians.
Usage :
$self->_centerColumns('mean', $lineEnding, $numColumnsToReport);
_centerRows
This protected method actually centers the row data, by calculating the average (mean or median, depending on what was requested) for each row, and then subtracting that value from each valid datapoint in the row.
Usage :
$self->_centerRows('median', $lineEnding, $numRowsToReport);
_filterRowsByPercentPresentData
This protected method invalidates rows that do not have greater than the requested percentage of present data.
Usage :
$self->_filterRowsByPercentPresentData($percent, $lineEnding, $numRowsToReport);
_filterColumnsByPercentPresentData
This protected method invalidates columns that do not have greater than the requested percentage of present data.
Usage :
$self->_filterColumnsByPercentPresentData($percent, $lineEnding, $numColumnsToReport);
_filterRowsOnColumnPercentile
This protected method filters out rows based on their column percentile, when all data are known to be in memory, and optionally allows for the percentiles of each datapoint to be displayed in the output file.
Usage:
$self->_filterRowsOnColumnPercentile($lineEnding, $numColumnsToReport, $percentile, $numColumns, $showPercentile);
_filterRowsOnColumnDeviation
This protected method will filter out rows whose values do not deviate from the column mean by a specified number of standard deviations, in at least numColumns columns.
Usage:
$self->_filterRowsOnColumnDeviation($lineEnding, $numRowsToReport, $deviations, $numColumns);
_filterRowsOnValues
This protected method filters out rows whose values do not pass a specified criterion, in at least numColumns columns.
Usage :
$self->_filterRowsOnValues($value, $method, $lineEnding, $numRowsToReport, $numColumns);
_filterRowsOnVectorLength
This protected method filters out rows based on whether the vector that their values define has a length of greater than the specified length.
Usage :
$self->_filterRowsOnVectorLength($requiredLength, $lineEnding, $numRowsToReport);
_logTransformData
This method log transforms the contents of the data matrix, using the specified base for the log transformation.
Usage:
$self->_logTransformData($logBase, $lineEnding, $numRowsToReport);
_scaleColumnData
This protected method scales the data for particular columns as specified by the client, when all data are in memory.
Usage :
$self->_scaleColumnData($columnsToFactorsHashRef, $lineEnding, $numColumnsToReport);
public methods
dumpData
This method dumps the current contents of the dataMatrix object to a file, either whose name was provided as a single argument, or to a file whose name was used to construct the object.
Usage:
$self->dumpData($file);
AUTHOR
Gavin Sherlock
sherlock@genome.stanford.edu