NAME
Algorithm::Classifier::IsolationForest - unsupervised anomaly detection via Isolation Forest or Extended Isolation Forest
SYNOPSIS
use Algorithm::Classifier::IsolationForest;
my @data = ([0.1, -0.2], [0.0, 0.1], [5.0, 6.0], ...);
# Classic, axis-parallel Isolation Forest
my $iforest = Algorithm::Classifier::IsolationForest->new(
n_trees => 100,
sample_size => 256,
seed => 42,
);
$iforest->fit(\@data);
my $scores = $iforest->score_samples(\@data); # arrayref, each in (0,1]
my $flags = $iforest->predict(\@data, 0.6); # arrayref of 0/1
# Save and reload
$iforest->save('model.json');
my $reloaded = Algorithm::Classifier::IsolationForest->load('model.json');
# Extended Isolation Forest (oblique hyperplane splits)
my $eif = IsolationForest->new(mode => 'extended', seed => 42);
$eif->fit(\@data);
DESCRIPTION
Isolation Forest (Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua, 2008) detects anomalies by random partitioning rather than by modelling normal points. Each tree repeatedly splits the data. Points that get isolated after only a few splits are likely anomalies. The score is the average isolation depth across many trees, normalised so values approach 1 for anomalies and stay below 0.5 for normal points.
In extended mode the module implements the Extended Isolation Forest variant. Each split is a random hyperplane instead of an axis-aligned cut, which removes the rectangular, axis-aligned bias in the score field and tends to help on elongated or multi-modal data.
psi refernced below is ψ or the pitchfork math symbol refrenced in paper, Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.
... or max samples.
https://www.researchgate.net/publication/224384174_Isolation_Forest
GENERAL METHODS
new(%args)
Inits the object.
- n_trees :: number of isolation trees in the ensemble
default :: 100
- sample_size :: sub-sample size used to build each tree... max samples
default :: 256
- max_depth :: per-tree height limit... if not defined is set to ceil(log2(psi))
default :: undef
- seed :: optional integer to seed srand with for reproducible trees...
see perldoc -f srand for more info. This number is processed via abs(int()).
default :: undef
- mode :: if it should be IF or EIF
axis :: classic axis-parallel splits (IF)
extended :: oblique hyperplane splits (EIF)
default :: axis
- extension_level :: extended mode only... how many features take partin each
split. 0 behaves like a single-feature (axis) cut; the
maximum (n_features - 1) uses every varying feature. undef
=> maximum. Clamped to [0, n_features - 1] at fit time.
- contamination :: expected fraction of anomalies, in (0, 0.5]. When given,
fit() learns a score threshold that flags this fraction of
the training set, and predict() uses it by default. undef
=> no learned threshold (predict() falls back to 0.5).
default :: undef
Note: log2 under Perl is as below...
log($psi) / log(2)
decision_threshold
The score cutoff predict uses by default; undef unless contamination was set.
fit
Trains the model on the specified data.
The data taken is a array of arrays, with each sub array containing two numbers.
@training_data = (
[ 3, 5 ],
[ 2.3, 1 ],
[ 5, 9 ],
...
);
Below shows a example of building a gausing cluster and using that for training.
# so it is reproducible
srand(7);
# build a gaussian cluster and add a handful out outliers...
use constant PI => 3.14159265358979;
sub gaussian {
my ($mu, $sigma) = @_;
my $u1 = rand() || 1e-12;
my $u2 = rand();
my $z = sqrt(-2 * log($u1)) * cos(2 * PI * $u2);
return $mu + $sigma * $z;
}
# add some normal items
for (1 .. 500) {
push @data, [ gaussian(0, 1), gaussian(0, 1) ];
push @truth, 0;
}
# add some outliers
for (1 .. 20) {
my $angle = rand() * 2 * PI;
my $radius = 5 + rand() * 3; # distance 5..8 from the origin
push @data, [ $radius * cos($angle), $radius * sin($angle) ];
push @truth, 1;
}
$iforest->fit(\@training_data);
path_lengths(\@data)
Returns the mean isolation depth per sample, for inspection.
my @lenghts = $forest->path_lengths(\@data);
print "x, y, length\n";
my $int=0;
while (defined($data[$int])) {
print $data[$int][0].', '.$data[$int][1].', '.$lenghts[$int]."\n";
$int++;
}
Returns an arrayref of 0/1 labels for the specified data.
If theshold is not specified it uses whatever the set default.
my $results = $forest->predict(\@data, $threshold);
print "x, y, result\n";
my $int=0;
while (defined($data[$int])) {
print $data[$int][0].', '.$data[$int][1].', '.$results->[$int]."\n";
$int++;
}
score_samples(\@data)
Returns an arrayref of anomaly scores, between 0 and 1.
Scores near 1 are strong anomalies (isolated quickly).
Scores well below 0.5 are normal.
Scores ~0.5 means the points are hard to tell apart.
my $scores = $forest->path_lengths(\@data);
print "x, y, length\n";
my $int=0;
while (defined($data[$int])) {
print $data[$int][0].', '.$data[$int][1].', '.$scores->[$int]."\n";
$int++;
}
score_predict_samples
Returns a array ref of arrays. First value of each sub array is the score with the second being 0/1 for if it is a anomaly or not.
my $results = $forest->predict(\@data, $threshold);
print "x, y, score, result\n";
my $int=0;
while (defined($data[$int])) {
print $data[$int][0].', '.$data[$int][1].', '.$results->[$int][0].', '.$results->[$int][1]."\n";
$int++;
}
MODEL SAVE/LOAD METHODS
to_json
Returns a JSON representation of the module.
Required being fit having to be called.
my $json = $iforest->to_json;
from_json($json)
Init the object from the model in the specified JSON string.
my $iforest = Algorithm::Classifier::IsolationForest->from_json($json);
save($path)
Saves the model to the specified path.
$iforest->save($path);
load($path);
Init the object from the model in the specified file.
my $iforest = Algorithm::Classifier::IsolationForest->load($path);
REFERENCES
Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.
https://www.researchgate.net/publication/224384174_Isolation_Forest
https://ieeexplore.ieee.org/abstract/document/4781136
Sahand Hariri, Matias Carrasco Kind, Robert J. Brunner (2020). Extended Isolation Forest. 1479 - 1489. 10.1109/TKDE.2019.2947676
https://ieeexplore.ieee.org/document/8888179
2 POD Errors
The following errors were encountered while parsing the POD:
- Around line 59:
Non-ASCII character seen before =encoding in 'ψ'. Assuming CP1252
- Around line 297:
Unknown directive: =head