NAME
Algorithm::AM::DataSet - Manage data used by Algorithm::AM
VERSION
version 3.13
SYNOPSIS
use Algorithm::AM::DataSet 'dataset_from_file';
use Algorithm::AM::DataSet::Item 'new_item';
my $dataset = Algorithm::AM::DataSet->new(cardinality => 10);
# or
$dataset = dataset_from_file(path => 'finnverb', format => 'nocommas');
$dataset->add_item(
new_item(features => [qw(a b c d e f g h i)]));
my $item = $dataset->get_item(2);
DESCRIPTION
This package contains a list of items that can be used by Algorithm::AM or Algorithm::AM::Batch for classification. DataSets can be made one item at a time via the "add_item" method, or they can be read from files via the "dataset_from_file" function.
new
Creates a new DataSet object. You must provide a cardinality
argument indicating the number of features to be contained in each data vector. You can then add items via the add_item method. Each item will contain a feature vector, and also optionally a class label and a comment (also called a "spec").
cardinality
Returns the number of features contained in the feature vector of a single item.
size
Returns the number of items in the data set.
classes
Returns the list of all unique class labels in the data set.
add_item
Adds a new item to the data set. The input may be either an Algorithm::AM::DataSet::Item object, or the arguments to create one via its constructor (features, class, comment). This method will croak if the cardinality of the item does not match "cardinality".
get_item
Return the item at the given index. This will be a Algorithm::AM::DataSet::Item object.
num_classes
Returns the number of different classification labels contained in the data set.
dataset_from_file
This function may be exported. Given 'path' and 'format' arguments, it reads a file containing a dataset and returns a new DataSet object with the given data. The 'path' argument should be the path to the file. The 'format' argument should be 'commas' or 'nocommas', indicating one of the following formats. You may also specify 'unknown' and 'null' arguments to indicate the strings meant to represent an unknown class value and null feature values. By default these are 'UNK' and '='.
The 'commas' file format is shown below:
class , f eat u re s , your comment here
The commas separate the class label, feature values, and comments, and the whitespace around the commas is optional. Each feature value is separated with whitespace.
The 'nocommas' file format is shown below:
class features your comment here
Here the class, feature values, and comments are separated by whitespace. Each feature value must be a single character with no separating characters, so here the features are f, e, a, t, u, r, e, and s.
Lines beginning with a pound character (#
) are ignored.
SEE ALSO
For information on creating data sets, see the appendices in the "red book", Analogical Modeling: An exemplar-based approach to language. See also the "green book", Analogical Modeling of Language, for an explanation of the method in general, and the "blue book", Analogy and Structure, for its mathematical basis.
AUTHOR
Theron Stanford <shixilun@yahoo.com>, Nathan Glenn <garfieldnate@gmail.com>
COPYRIGHT AND LICENSE
This software is copyright (c) 2021 by Royal Skousen.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.