The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

AI::NaiveBayes1 - Naive Bayes Classification

SYNOPSIS

  use AI::NaiveBayes1;
  my $nb = AI::NaiveBayes1->new;
  $nb->add_table(
  "Html  Caps  Free  Spam  count
  -------------------------------
     Y     Y     Y     Y    42   
     Y     Y     Y     N    32   
     Y     Y     N     Y    17   
     Y     Y     N     N     7   
     Y     N     Y     Y    32   
     Y     N     Y     N    12   
     Y     N     N     Y    20   
     Y     N     N     N    16   
     N     Y     Y     Y    38   
     N     Y     Y     N    18   
     N     Y     N     Y    16   
     N     Y     N     N    16   
     N     N     Y     Y     2   
     N     N     Y     N     9   
     N     N     N     Y    11   
     N     N     N     N    91   
  -------------------------------
  ");
  $nb->train;
  print "Model:\n" . $nb->print_model;
  print "Model (with counts):\n" . $nb->print_model('with counts');

  $nb = AI::NaiveBayes1->new;
  $nb->add_instances(attributes=>{model=>'H',place=>'B'},
                     label=>'repairs=Y',cases=>30);
  $nb->add_instances(attributes=>{model=>'H',place=>'B'},
                     label=>'repairs=N',cases=>10);
  $nb->add_instances(attributes=>{model=>'H',place=>'N'},
                     label=>'repairs=Y',cases=>18);
  $nb->add_instances(attributes=>{model=>'H',place=>'N'},
                     label=>'repairs=N',cases=>16);
  $nb->add_instances(attributes=>{model=>'T',place=>'B'},
                     label=>'repairs=Y',cases=>22);
  $nb->add_instances(attributes=>{model=>'T',place=>'B'},
                     label=>'repairs=N',cases=>14);
  $nb->add_instances(attributes=>{model=>'T',place=>'N'},
                     label=>'repairs=Y',cases=> 6);
  $nb->add_instances(attributes=>{model=>'T',place=>'N'},
                     label=>'repairs=N',cases=>84);

  $nb->train;

  print "Model:\n" . $nb->print_model;
  
  # Find results for unseen instances
  my $result = $nb->predict
     (attributes => {model=>'T', place=>'N'});

  foreach my $k (keys(%{ $result })) {
      print "for label $k P = " . $result->{$k} . "\n";
  }

  # export the model into a string
  my $string = $nb->export_to_YAML();

  # create the same model from the string
  my $nb1 = AI::NaiveBayes1->import_from_YAML($string);

  # write the model to a file (shorter than model->string->file)
  $nb->export_to_YAML_file('t/tmp1');

  # read the model from a file (shorter than file->string->model)
  my $nb2 = AI::NaiveBayes1->import_from_YAML_file('t/tmp1');

See Examples for more examples.

DESCRIPTION

This module implements the classic "Naive Bayes" machine learning algorithm.

Data Structure

An object contains the following fields:

{attributes}

List of attribute names.

{attribute_type}{$a}

Attribute types - 'real', or not (e.g., 'nominal')

{labels}

List of labels.

{attvals}{$a}

List of attribute values

{real_stat}{$a}{$v}{$l}{sum}

Statistics for real valued attributes; besides 'sum' also: count, mean, stddev

{numof_instances}

Number of training instances.

{stat_labels}{$l}

Label count in training data.

{stat_attributes}{$a}

Statistics for an attribute: ...{$value}{$label} = count of instances.

{smoothing}{$attribute}

Attribute smoothing. No smoothing if does not exist. Implemented smoothing:

      - /^unseen count=/ followed by number, e.g., 0.5

Attribute Smoothing

For an attribute A one can specify:

    $nb->{smoothing}{A} = 'unseen count=0.5';

to provide a count for unseen data. The count is taken into consideration in training and prediction, when any unseen attribute values are observed. Zero probabilities can be prevented in this way. A count other than 0.5 can be provided, but if it is <=0 it will be set to 0.5. The method is similar to add-one smoothing. A special attribute value '*' is used for all unseen data.

METHODS

Constructor Methods

new()

Constructor. Creates a new AI::NaiveBayes1 object and returns it.

import_from_YAML($string)

Constructor. Creates a new AI::NaiveBayes1 object from a string where it is represented in YAML. Requires YAML module.

import_from_YAML_file($file_name)

Constructor. Creates a new AI::NaiveBayes1 object from a file where it is represented in YAML. Requires YAML module.

Non-Constructor Methods

add_table()

Add instances from a table. The first row are attributes, followed by values. If the name of the last attribute is `count', it is interpreted as a repetition count and used appropriatelly. The last attribute (after optionally removing `count') is the class attribute. The attributes and values are separated by white space.

add_csv_file($filename)

Add instances from a CSV file. Primitive format implementation (e.g., no commas allowed in attribute names or values).

drop_attributes(@attributes)

Delete attributes after adding instances.

set_real(list_of_attributes)

Delares a list of attributes to be real-valued. During training, their conditional probabilities will be modeled with Gaussian (normal) distributions.

add_instance(attributes=>HASH,label=>STRING|ARRAY)

Adds a training instance to the categorizer.

add_instances(attributes=>HASH,label=>STRING|ARRAY,cases=>NUMBER)

Adds a number of identical instances to the categorizer.

export_to_YAML()

Returns a YAML string representation of an AI::NaiveBayes1 object. Requires YAML module.

export_to_YAML_file( $file_name )

Writes a YAML string representation of an AI::NaiveBayes1 object to a file. Requires YAML module.

Returns a string, human-friendly representation of the model. The model is supposed to be trained before calling this method. One argument 'with counts' can be supplied, in which case explanatory expressions with counts are printed as well.

train()

Calculates the probabilities that will be necessary for categorization using the predict() method.

predict( attributes => HASH )

Use this method to predict the label of an unknown instance. The attributes should be of the same format as you passed to add_instance(). predict() returns a hash reference whose keys are the names of labels, and whose values are corresponding probabilities.

labels

Returns a list of all the labels the object knows about (in no particular order), or the number of labels if called in a scalar context.

THEORY

Bayes' Theorem is a way of inverting a conditional probability. It states:

                P(y|x) P(x)
      P(x|y) = -------------
                   P(y)

and so on...

This is a pretty standard algorithm explained in many machine learning textbooks (e.g., "Data Mining" by Witten and Eibe).

The algorithm relies on estimating P(A|C), where A is an arbitrary attribute, and C is the class attribute. If A is not real-valued, then this conditional probability is estimated using a table of all possible values for A and C.

If A is real-valued, then the distribution P(A|C) is modeled as a Gaussian (normal) distribution for each possible value of C=c, Hence, for each C=c we collect the mean value (m) and standard deviation (s) for A during training. During classification, P(A=a|C=c) is estimated using Gaussian distribution, i.e., in the following way:

                    1               (a-m)^2
 P(A=a|C=c) = ------------ * exp( - ------- )
              sqrt(2*Pi)*s           2*s^2

this boils down to the following lines of code:

    $scores{$label} *=
    0.398942280401433 / $m->{real_stat}{$att}{$label}{stddev}*
      exp( -0.5 *
           ( ( $newattrs->{$att} -
               $m->{real_stat}{$att}{$label}{mean})
             / $m->{real_stat}{$att}{$label}{stddev}
           ) ** 2
           );

i.e.,

  P(A=a|C=c) = 0.398942280401433 / s *
    exp( -0.5 * ( ( a-m ) / s ) ** 2 );

EXAMPLES

Example with a real-valued attribute modeled by a Gaussian distribution (from Witten I. and Frank E. book "Data Mining" (the WEKA book), page 86):

 # @relation weather
 # 
 # @attribute outlook {sunny, overcast, rainy}
 # @attribute temperature real
 # @attribute humidity real
 # @attribute windy {TRUE, FALSE}
 # @attribute play {yes, no}
 # 
 # @data
 # sunny,85,85,FALSE,no
 # sunny,80,90,TRUE,no
 # overcast,83,86,FALSE,yes
 # rainy,70,96,FALSE,yes
 # rainy,68,80,FALSE,yes
 # rainy,65,70,TRUE,no
 # overcast,64,65,TRUE,yes
 # sunny,72,95,FALSE,no
 # sunny,69,70,FALSE,yes
 # rainy,75,80,FALSE,yes
 # sunny,75,70,TRUE,yes
 # overcast,72,90,TRUE,yes
 # overcast,81,75,FALSE,yes
 # rainy,71,91,TRUE,no
 
 $nb->set_real('temperature', 'humidity');
 
 $nb->add_instance(attributes=>{outlook=>'sunny',temperature=>85,humidity=>85,windy=>'FALSE'},label=>'play=no');
 $nb->add_instance(attributes=>{outlook=>'sunny',temperature=>80,humidity=>90,windy=>'TRUE'},label=>'play=no');
 $nb->add_instance(attributes=>{outlook=>'overcast',temperature=>83,humidity=>86,windy=>'FALSE'},label=>'play=yes');
 $nb->add_instance(attributes=>{outlook=>'rainy',temperature=>70,humidity=>96,windy=>'FALSE'},label=>'play=yes');
 $nb->add_instance(attributes=>{outlook=>'rainy',temperature=>68,humidity=>80,windy=>'FALSE'},label=>'play=yes');
 $nb->add_instance(attributes=>{outlook=>'rainy',temperature=>65,humidity=>70,windy=>'TRUE'},label=>'play=no');
 $nb->add_instance(attributes=>{outlook=>'overcast',temperature=>64,humidity=>65,windy=>'TRUE'},label=>'play=yes');
 $nb->add_instance(attributes=>{outlook=>'sunny',temperature=>72,humidity=>95,windy=>'FALSE'},label=>'play=no');
 $nb->add_instance(attributes=>{outlook=>'sunny',temperature=>69,humidity=>70,windy=>'FALSE'},label=>'play=yes');
 $nb->add_instance(attributes=>{outlook=>'rainy',temperature=>75,humidity=>80,windy=>'FALSE'},label=>'play=yes');
 $nb->add_instance(attributes=>{outlook=>'sunny',temperature=>75,humidity=>70,windy=>'TRUE'},label=>'play=yes');
 $nb->add_instance(attributes=>{outlook=>'overcast',temperature=>72,humidity=>90,windy=>'TRUE'},label=>'play=yes');
 $nb->add_instance(attributes=>{outlook=>'overcast',temperature=>81,humidity=>75,windy=>'FALSE'},label=>'play=yes');
 $nb->add_instance(attributes=>{outlook=>'rainy',temperature=>71,humidity=>91,windy=>'TRUE'},label=>'play=no');
 
 $nb->train;
 
 my $printedmodel =  "Model:\n" . $nb->print_model;
 my $p = $nb->predict(attributes=>{outlook=>'sunny',temperature=>66,humidity=>90,windy=>'TRUE'});

 YAML::DumpFile('file', $p);
 die unless (abs($p->{'play=no'}  - 0.792) < 0.001);
 die unless(abs($p->{'play=yes'} - 0.208) < 0.001);

HISTORY

Algorithm::NaiveBayes by Ken Williams was not what I needed so I wrote this one. Algorithm::NaiveBayes is oriented towards text categorization, it includes smoothing, and log probabilities. This module is a generic, basic Naive Bayes algorithm.

THANKS

I would like to thank Daniel Bohmer for documentation corrections, Yung-chung Lin (cpan:xern) for the implementation of the Gaussian model for continuous variables, and the following people for bug reports, support, and comments (in no particular order):

Michael Stevens, Tom Dyson, Dan Von Kohorn, Craig Talbert, Andrew Brian Clegg,

and CPAN-testers, including: Andreas Koenig, Alexandr Ciornii, jlatour, Jost.Krieger, tvmaly, Matthew Musgrove, Michael Stevens, Nigel Horne, Graham Crookham, David Cantrell (dcantrell).

AUTHOR

Copyright 2003-21 Vlado Keselj https://web.cs.dal.ca/~vlado. In 2004 Yung-chung Lin provided implementation of the Gaussian model for continous variables.

This script is provided "as is" without expressed or implied warranty. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The module is available on CPAN (https://metacpan.org/author/VLADO), and https://web.cs.dal.ca/~vlado/srcperl/. The latter site is updated more frequently.

SEE ALSO

Algorithm::NaiveBayes, perl.