NAME

Algorithm::Classifier::IsolationForest - unsupervised anomaly detection via Isolation Forest or Extended Isolation Forest

SYNOPSIS

use Algorithm::Classifier::IsolationForest;

my @data = ([0.1, -0.2], [0.0, 0.1], [5.0, 6.0], ...);

# Classic, axis-parallel Isolation Forest
my $iforest = Algorithm::Classifier::IsolationForest->new(
    n_trees     => 100,
    sample_size => 256,
    seed        => 42,
);
$iforest->fit(\@data);

my $scores = $iforest->score_samples(\@data);  # arrayref, each in (0,1]
my $flags  = $iforest->predict(\@data, 0.6);    # arrayref of 0/1

# Save and reload
$iforest->save('model.json');
my $reloaded = Algorithm::Classifier::IsolationForest->load('model.json');

# Extended Isolation Forest (oblique hyperplane splits)
my $eif = IsolationForest->new(mode => 'extended', seed => 42);
$eif->fit(\@data);

DESCRIPTION

Isolation Forest (Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua, 2008) detects anomalies by random partitioning rather than by modelling normal points. Each tree repeatedly splits the data. Points that get isolated after only a few splits are likely anomalies. The score is the average isolation depth across many trees, normalised so values approach 1 for anomalies and stay below 0.5 for normal points.

In extended mode the module implements the Extended Isolation Forest variant. Each split is a random hyperplane instead of an axis-aligned cut, which removes the rectangular, axis-aligned bias in the score field and tends to help on elongated or multi-modal data.

psi refernced below is ψ or the pitchfork math symbol refrenced in paper, Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.

... or max samples.

https://www.researchgate.net/publication/224384174_Isolation_Forest

GENERAL METHODS

new(%args)

Inits the object.

- n_trees :: number of isolation trees in the ensemble
    default :: 100

- sample_size :: sub-sample size used to build each tree... max samples
    default :: 256

 - max_depth :: per-tree height limit... if not defined is set to ceil(log2(psi))
     default :: undef

 - seed :: optional integer to seed srand with for reproducible trees...
         see perldoc -f srand for more info. This number is processed via abs(int()).
     default :: undef

 - mode :: if it should be IF or EIF
      axis :: classic axis-parallel splits (IF)
      extended :: oblique hyperplane splits (EIF)
    default :: axis

 - extension_level :: extended mode only... how many features take partin each
         split. 0 behaves like a single-feature (axis) cut; the
         maximum (n_features - 1) uses every varying feature. undef
         => maximum. Clamped to [0, n_features - 1] at fit time.

  - contamination :: expected fraction of anomalies, in (0, 0.5]. When given,
        fit() learns a score threshold that flags this fraction of
        the training set, and predict() uses it by default. undef
        => no learned threshold (predict() falls back to 0.5).
      default :: undef

Note: log2 under Perl is as below...

log($psi) / log(2)

decision_threshold

The score cutoff predict uses by default; undef unless contamination was set.

fit

Trains the model on the specified data.

The data taken is a array of arrays, with each sub array containing two numbers.

@training_data = (
    [ 3, 5 ],
    [ 2.3, 1 ],
    [ 5, 9 ],
    ...
);

Below shows a example of building a gausing cluster and using that for training.

# so it is reproducible
srand(7);

# build a gaussian cluster and add a handful out outliers...

use constant PI => 3.14159265358979;
sub gaussian {
    my ($mu, $sigma) = @_;
    my $u1 = rand() || 1e-12;
    my $u2 = rand();
    my $z  = sqrt(-2 * log($u1)) * cos(2 * PI * $u2);
    return $mu + $sigma * $z;
}

# add some normal items
for (1 .. 500) {
    push @data,  [ gaussian(0, 1), gaussian(0, 1) ];
    push @truth, 0;
}
# add some outliers
for (1 .. 20) {
    my $angle  = rand() * 2 * PI;
    my $radius = 5 + rand() * 3;             # distance 5..8 from the origin
    push @data,  [ $radius * cos($angle), $radius * sin($angle) ];
    push @truth, 1;
}

$iforest->fit(\@training_data);

path_lengths(\@data)

Returns the mean isolation depth per sample, for inspection.

my @lenghts = $forest->path_lengths(\@data);

print "x, y, length\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$lenghts[$int]."\n";

    $int++;
}

Returns an arrayref of 0/1 labels for the specified data.

If theshold is not specified it uses whatever the set default.

my $results = $forest->predict(\@data, $threshold);

print "x, y, result\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$results->[$int]."\n";

    $int++;
}

score_samples(\@data)

Returns an arrayref of anomaly scores, between 0 and 1.

Scores near 1 are strong anomalies (isolated quickly).

Scores well below 0.5 are normal.

Scores ~0.5 means the points are hard to tell apart.

my $scores = $forest->path_lengths(\@data);

print "x, y, length\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$scores->[$int]."\n";

    $int++;
}

score_predict_samples

Returns a array ref of arrays. First value of each sub array is the score with the second being 0/1 for if it is a anomaly or not.

my $results = $forest->predict(\@data, $threshold);

print "x, y, score, result\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$results->[$int][0].', '.$results->[$int][1]."\n";

    $int++;
}

MODEL SAVE/LOAD METHODS

to_json

Returns a JSON representation of the module.

Required being fit having to be called.

my $json = $iforest->to_json;

from_json($json)

Init the object from the model in the specified JSON string.

my $iforest = Algorithm::Classifier::IsolationForest->from_json($json);

save($path)

Saves the model to the specified path.

$iforest->save($path);

load($path);

Init the object from the model in the specified file.

my $iforest = Algorithm::Classifier::IsolationForest->load($path);

REFERENCES

Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.

https://www.researchgate.net/publication/224384174_Isolation_Forest

https://ieeexplore.ieee.org/abstract/document/4781136

Sahand Hariri, Matias Carrasco Kind, Robert J. Brunner (2020). Extended Isolation Forest. 1479 - 1489. 10.1109/TKDE.2019.2947676

https://ieeexplore.ieee.org/document/8888179

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 59:

Non-ASCII character seen before =encoding in 'ψ'. Assuming CP1252

Around line 297:

Unknown directive: =head