The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Data::FeatureFactory - evaluate features normally or numerically

SYNOPSIS

 # in the module that defines features
 package MyFeatures;
 use base qw(Data::FeatureFactory);
 
 our @features = (
    { name => 'no_of_letters', type => 'int', range => '0 .. 5' },
    { name => 'first_letter',  type => 'cat', 'values' => ['a' .. 'z'] },
 );
 
 sub no_of_letters {
    my ($word) = @_;
    return length $word
 }
 
 sub first_letter {
    my ($word) = @_;
    return substr $word, 0, 1
 }

 # in the main script
 package main;
 use MyFeatures;
 my $f = MyFeatures->new;
 
 # evaluate all the features on all your data and format them numerically
 for my $record (@data) {
     my @values = $f->evaluate('ALL', 'numeric', $record);
     print join(' ', @values);
 }
 
 # specify the features to evaluate and gather the result in binary form
 my @vector = $f->evaluate([qw(no_of_letters first_letter)], 'binary', 'foo');

DESCRIPTION

Data::FeatureFactory automates evaluation of features of data samples and optionally encodes them as numbers or as binary vectors.

Defining features

The features are defined as subroutines in a package inheriting from Data::FeatureFactory. A subroutine is declared to be a feature by being mentioned in the package array @features. Options for the features are also specified in this array. Its minimum structure is as follows:

 @features = (
    { name => "name of feature 1" },
    { name => "name of feature 2" },
    ...
 )

The elements of the array must be hashrefs and each of them must have a name field. Other fields can specify options for the features. These are:

type

Specifies if the feature is categorial, numeric, integer or boolean. Only the first three characters, case insensitive, are considered, so you can as well say cat, Num, integral or Boo!. The default type is categorial.

Integer and numeric features will have values forced to numbers. Boolean ones will have values converted to 1/0 depending on Perl's notion of True/False. If you use warnings, you'll get one if your numeric feature returns a non-numeric string.

values

Lists the acceptable values for the feature to return. If a different value is returned by the subroutine, the whole feature vector is discarded. Alternatively, a default value can be specified. Whenever the order of the values matters, it is honored (as in transfer to numeric format). The values can be specified as an arrayref (in which case the order is regarded) or as a hashref, in which case the values are pseudo-randomly ordered, but the loading time is faster and transfer to numeric or binary format is faster as well. If the values are specified as a hashref, then keys of the hash shall contain the values of the feature and values of the hash should be 1's.

default

Specifies a default value to be substituted when the feature returns something not listed in values.

values_file

The values can either be listed directly or in a file. This option specifies its name. This option must not appear in combination with the values option. Each value shall be on one line, with no headers, no intervening whitespace no comments and no empty lines.

range

In case of integer and numeric features, an allowed range can be specified instead of the values. This option cannot appear together with the values or values_file option. The behavior is the same as with the values option. The interval specified is closed, so returning the limit value is OK. The range shall be specified by two numeric expressions separated by two or more dots with optional surrounding whitespace - for example 2..5 or -0.5 ...... +1.000_005. The stuff around the dots are not checked to be valid numeric expressions. But you should get a warning if you use them when you supply something nonsensical.

You can also specify a range for numeric (non-integer) features. The return value will be checked against it but unlike integer features, this will not generate a list of acceptible values. Therefore, range is not enough to specify for a numeric feature if you want to have it converted to binary. (though converting floating-point values to binary vectors seems rather quirky by itself)

postproc

This option defines a subroutine that is to be used as a filter for the feature's return value. It comes in handy when you, for example, have a feature returning UTF-8 encoded text and you need it to appear ASCII-encoded but you need to specify the acceptable values in UTF-8. As this use-case suggests, the postprocessing takes place after the value is checked against the list of acceptable values. The value for this option shall either be a coderef or the name of the preprocessing function. If the function is not available in the current namespace, Data::FeatureFactory will attempt to find it.

The postprocessing only takes place when the feature is evaluated normally - that is, when its output is not being transformed to numeric or binary format.

code

Normally, the features are defined as subroutines in the package that inherits from Data::FeatureFactory. However, the definition can also be provided as a coderef in this option or in the %features hash of the package. The priority is: 1) the code option, 2) the %features hash, and 3) the package subroutine.

format

Features can be output in different ways - see below. The format in which the features are evaluated is normally specified for all features in the call to evaluate. You can override it for specific features with this option.

You'll mostly use this to prevent the target (to-predict) feature from being numified or binarified: { name => 'target', format => 'normal' }.

Note that both the feature and the optional postprocessing routine are evaluated in scalar context.

Creating the features object

Data::FeatureFactory has two methods: new and evaluate. new creates an object that can then be used to evaluate features. Please do *not* override the new method. If you do, then be sure that it calls Data::FeatureFactory::new properly. This method accepts an optional argument - a hashref with options. Currently, only the 'N/A' option is supported. See below for details.

Evaluating features

The evaluate method of Data::FeatureFactory takes these arguments: 1) names of the features to evaluate, 2) the format in which they should be output and 3) arguments for the features themselves.

The first argument can be an arrayref with the names of the features, or it can be the "ALL" string, which denotes that all features defined shall be evaluated. If it contains any other string, then it's interpreted as the name of the only feature to evaluate.

The second argument is normal, numeric or binary. normal means that the features' return values should be left alone (but postprocessed if such option is set). numeric and binary mean that the features' return values should be converted into numbers or binary vectors, as for support vector machines or neural networks to like them.

The return value is the list of what the features returned. In case of binary, there can be a different (typically greater) number of elements in the returned list than there were features to evaluate.

Transfer to numeric / binary form

When you have the features output in numeric format, then integer and numeric features are left alone and categorial ones have a natural number (starting with 1) assigned to every distinct value. If you use this feature, it is highly recommended to specify the values for the feature. If you don't then Data::FeatureFactory will attempt to create a mapping from the categories to numbers dynamically as then feature is evaluated. The mapping is being saved to a file whose name is .FeatureFactory.package_name__feature_name and is located in the directory where Data::FeatureFactory resides if possible, or in your home directory or in /tmp - wherever the script can write. If none works, then you get a fatal error. The mapping is restored and extended upon subsequent runs with the same package and feature name, if read/write permissions don't change.

Binary format is such that the return value is converted to a vector of all 0's and one 1. The positions in the vector represent the possible values of the feature and 1 is on the position that the feature actually has in that particular case. The values always need to be specified for this feature to work and it is highly recommended that they be specified with a fixed order (not by a hash), because else the order can change with different versions of perl and when you change the set of values for the feature. And when the order changes, then the meaning of the vectors change.

N/A values

You can specify a value to be substituted when a feature returns nothing (an undefined value). This is passed as an argument to the new method.

 $f = MyFeatures->new({ 'N/A' => '_' }); # MyFeatures inherits from Data::FeatureFactory
 $v = $f->evaluate('feature1', 'normal', 'unexpected_argument');

If feature1 returns an undefined value, then $v will contain the string '_'. When evaluating in binary format, a vector of the usual length is returned, with all values being the specified N/A. That is, if feature1 has 3 possible values, then

 @v = $f->evaluate('feature1', 'binary', 'unexpected_argument');

will result in @v being ('_', '_', '_'). If feature1 returns undef, that is.

COPYRIGHT

Copyright (c) 2008 Oldrich Kruza. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.