NAME

Algorithm::Classifier::NaiveBayes - A multinomial naive Bayes text classifier with Laplace smoothing.

VERSION

Version 0.0.1

SYNOPSIS

use Algorithm::Classifier::NaiveBayes;

my $nb = Algorithm::Classifier::NaiveBayes->new;

# train it with examples of each class
$nb->train( 'spam', 'buy cheap pills now' );
$nb->train( 'spam', 'cheap watches for sale' );
$nb->train( 'ham',  'meeting at noon tomorrow' );
$nb->train( 'ham',  'lunch with the team' );

# classify some new text
my $class = $nb->classify('cheap pills for sale');
# $class is now 'spam'

# or get the score and probability for every class as well
my ( $best, $scores, $probs ) = $nb->classify('cheap pills for sale');

# save the model for later and load it again
$nb->save('model.json');

my $loaded = Algorithm::Classifier::NaiveBayes->new;
$loaded->load('model.json');

DESCRIPTION

This module implements a multinomial naive Bayes classifier. Strings are broken into tokens and each class is scored using the log of its prior probability, based on how often the class was trained, plus the sum of the log probabilities of each token appearing in that class. Token probabilities are smoothed so tokens never seen for a class do not zero out the whole score. By default this is add-one, Laplace, smoothing, but Lidstone, add-alpha, smoothing with a configurable alpha may be selected instead. Smaller alphas, such as 0.1 to 0.5, often perform better on small training sets.

By default token occurrences are weighted by their raw counts, but binary weighting, counting each unique token once per document, may be selected instead via token_weighting. Class priors default to how often each class was trained, but may be set to uniform via priors.

Classes are not predefined. A class exists once something has been trained for it and stops existing if everything for it is untrained.

The model may be saved to a JSON file or string and loaded back later, allowing training and classification to happen in different processes.

METHODS

new

Initiates the object.

my $nb = Algorithm::Classifier::NaiveBayes->new(%args);

The following args are supported.

lc_tokens - Lowercase tokens when tokenizing.
    Default: 1

token_splitter - Regex to use for splitting a string into tokens.
    Default: \s+

stop_regex - If defined, tokens matching this regex are dropped.
    Matched anchored, so it must match the entire token.
    Default: undef

smoothing - The smoothing to use for token probabilities. Either
    "laplace", add-one, or "lidstone", add-alpha.
    Default: laplace

alpha - The alpha to use for lidstone smoothing. Must be a number
    greater than 0. May only be specified when smoothing is set to
    lidstone. Laplace smoothing is lidstone with a alpha of 1.
    Default: 0.5

ngrams - Max size of n-grams to generate from adjacent tokens when
    tokenizing. 1 means single tokens only. 2 means also generate
    each adjacent pair of tokens joined by a space. 3 also adds
    triplets and so on.
    Default: 1

token_weighting - How token occurrences are weighted. "count" uses
    raw counts, so a token appearing three times in a document
    counts three times. "binary" counts each unique token once per
    document, both when training and classifying, which often works
    better for short texts. Also known as binarized multinomial
    naive Bayes.
    Default: count

priors - How class priors are computed when classifying. "trained"
    uses how often each class was trained, so classes with more
    documents are favored. "uniform" gives every class a equal
    prior, useful when the training set is unbalanced in a way real
    usage will not be.
    Default: trained

token_splitter and stop_regex may be either a string or a qr// Regexp.

Will die if passed a unknown arg or if token_splitter or stop_regex is a empty string, a ref other than a qr// Regexp, or does not compile as a regex.

Some examples...

# split on commas instead of whitespace
my $nb = Algorithm::Classifier::NaiveBayes->new( 'token_splitter' => ',' );

# keep the case of tokens
my $nb = Algorithm::Classifier::NaiveBayes->new( 'lc_tokens' => 0 );

# drop some common stop words
my $nb = Algorithm::Classifier::NaiveBayes->new( 'stop_regex' => qr/a|an|and|the|of|to/ );

# use lidstone smoothing with a alpha of 0.1
my $nb = Algorithm::Classifier::NaiveBayes->new( 'smoothing' => 'lidstone', 'alpha' => 0.1 );

# also generate bigrams, so phrases like "free cruise" become tokens
my $nb = Algorithm::Classifier::NaiveBayes->new( 'ngrams' => 2 );

# count each unique token once per document
my $nb = Algorithm::Classifier::NaiveBayes->new( 'token_weighting' => 'binary' );

# give every class a equal prior regardless of training balance
my $nb = Algorithm::Classifier::NaiveBayes->new( 'priors' => 'uniform' );

tokenize

Tokenizes the specified string. This is used internally by train, untrain, and classify, but may also be called directly to see how a string will be broken up.

my @tokens = $nb->tokenize($string);

The string is split via the token_splitter regex. Empty tokens are dropped. If lc_tokens is true, tokens are lowercased. If stop_regex is defined, tokens entirely matching it are dropped.

If ngrams is greater than 1, n-grams up to that size are generated from adjacent tokens and appended, joined by a space. This happens after lowercasing and stop word removal, so stop words do not appear inside n-grams.

my $nb = Algorithm::Classifier::NaiveBayes->new( 'ngrams' => 2 );
my @tokens = $nb->tokenize('Free Cruise Inside');
# ( 'free', 'cruise', 'inside', 'free cruise', 'cruise inside' )

Will die if the string is undef. As train, untrain, and classify all use this, passing undef text to any of those will also die.

my $nb = Algorithm::Classifier::NaiveBayes->new;
my @tokens = $nb->tokenize('Buy Cheap  Pills');
# ( 'buy', 'cheap', 'pills' )

train

Train a specific class on the specified string.

$nb->train($class, $string);

Will die if the class or string is undef.

The class does not need to exist prior to this being called. Training a new class name brings that class into existence.

$nb->train( 'spam', 'buy cheap pills now' );
$nb->train( 'ham',  'meeting at noon tomorrow' );

untrain

Untrain a specific class on the specified string, reversing a previous call to train with the same class and string.

$nb->untrain($class, $string);

Will die if the class or string is undef.

If the class in question has not been trained, this is a noop. Token counts will not be decremented below zero and classes with no remaining trained documents are removed from the model.

# trained into the wrong class, so move it
$nb->untrain( 'ham',  'buy cheap pills now' );
$nb->train(   'spam', 'buy cheap pills now' );

It is worth noting it can't be verified the string in question was actually previously trained for that class. Untraining a string that differs from what was trained will still decrement the document count for the class, along with whatever tokens overlap.

prune

Removes all tokens trained fewer than the specified number of times, totaled across all classes.

my $pruned = $nb->prune($min_count);

Real world training data tends to accumulate a long tail of tokens only seen once or twice. Those add noise and bloat the saved model, so pruning them can be useful after a large amount of training.

# remove all tokens only trained once
my $pruned = $nb->prune(2);

Returns the number of tokens removed. Removed tokens are dropped from the vocabulary and the per class token totals are decremented, but document counts are untouched, so class priors are unchanged.

Will die if min count is undef or not a whole number greater than 0. A min count of 1 is a noop as every trained token has a count of at least 1.

classes

Returns a sorted list of all currently trained classes.

my @classes = $nb->classes;

If nothing has been trained yet, an empty list is returned.

class_tokens

Returns a sorted list of all tokens trained for the specified class.

my @tokens = $nb->class_tokens($class);

Will die if no class is specified or if the class in question does not exist.

foreach my $class ( $nb->classes ) {
    print $class . ': ' . join( ', ', $nb->class_tokens($class) ) . "\n";
}

classify

Classify the text in question.

my $class = $nb->classify($text);

In scalar context, returns the name of the class the text most likely belongs to. In list context, also returns a hash ref of the score for every class as well as a hash ref of the probability of every class.

my ( $class, $scores, $probs ) = $nb->classify($text);
foreach my $possible ( sort { $scores->{$b} <=> $scores->{$a} } keys %{$scores} ) {
    print $possible . ': ' . $scores->{$possible} . ', ' . $probs->{$possible} . "\n";
}

The scores are log probabilities, so they are negative numbers with the one closest to zero being the most likely.

The probabilities are the scores normalized to sum to 1, so they may be used for things like requiring a minimum confidence.

my ( $class, $scores, $probs ) = $nb->classify($text);
if ( $probs->{$class} < 0.8 ) {
    $class = 'unsure';
}

It is worth noting naive Bayes probabilities tend to be overconfident thanks to the assumption tokens are independent of each other, with longer texts commonly producing probabilities very close to 1 or 0. They are good for ranking and thresholding, but should not be taken as calibrated probabilities.

If nothing has been trained yet, undef is returned in scalar context and ( undef, {}, {} ) in list context.

Ties are broken by sorting the tied class names, making the result deterministic.

explain

Classifies the text in question like classify, but returns a hash ref breaking down how the result was arrived at.

my $explanation = $nb->explain($text);

The returned hash ref is as below.

class - The best matching class, as classify would return.

scores - Hash ref of the log score of every class, as classify
    would return.

probs - Hash ref of the probability of every class, as classify
    would return.

priors - Hash ref of the log prior probability of every class,
    the part of the score that comes from how often the class was
    trained rather than from the tokens.

tokens - Hash ref of every token in the tokenized text. Each value
    is a hash ref with "count", how many times the token appeared
    in the text, and "contributions", a hash ref of the log
    probability that token added to each class per appearance.

For any class, the score is the prior plus count * contribution summed over every token. A token pushes towards the class it has the highest, closest to zero, contribution for. So finding the tokens most responsible for a classification can be done like below.

my $explanation = $nb->explain($text);
my ( $first, $second ) =
    sort { $explanation->{'scores'}{$b} <=> $explanation->{'scores'}{$a} }
    keys %{ $explanation->{'scores'} };
foreach my $token ( keys %{ $explanation->{'tokens'} } ) {
    my $contribs = $explanation->{'tokens'}{$token}{'contributions'};
    my $pull = ( $contribs->{$first} - $contribs->{$second} )
        * $explanation->{'tokens'}{$token}{'count'};
    print $token . ' pushed towards ' . $first . ' by ' . $pull . "\n";
}

Will die if the text is undef. If nothing has been trained yet, undef is returned.

tweak

Changes scoring settings on a existing model. Takes the args below, all optional, but at least one must be specified.

smoothing - The smoothing to use... laplace or lidstone.

alpha - The alpha to use for lidstone smoothing. Must be a number
    greater than 0. May only be specified when the resulting
    smoothing is lidstone.

priors - How class priors are computed... trained or uniform.

# switch to lidstone smoothing with a alpha of 0.1
$nb->tweak( 'smoothing' => 'lidstone', 'alpha' => 0.1 );

# switch to uniform priors
$nb->tweak( 'priors' => 'uniform' );

These are safe to change after training as they only affect scoring, not the trained counts. Settings that shape the trained data, such as ngrams, token_weighting, and the tokenizer settings, may not be changed here as that would make the model inconsistent with what was trained... for those, create a new object and retrain.

Only args specified with a defined value are changed. Args passed with a undef value are ignored, so it is safe to pass through possibly unset values.

Switching smoothing to laplace sets alpha to 1, as laplace is add-one. Switching to lidstone without specifying alpha keeps the current alpha.

Will die if passed a unknown arg, no args with defined values, or a insane value. If it dies, the model is left unchanged.

to_string

Returns the model as a JSON string. See the section MODEL FORMAT for what the JSON looks like.

my $json = $nb->to_string;

The JSON is generated with canonical set, so the keys are sorted, meaning two calls against the same model will always produce identical output, making it diffable.

If token_splitter or stop_regex was set to a qr// Regexp, it is stringified, so the result is always JSON safe.

from_string

Loads the model from the specified JSON string, replacing the current model, including any settings passed to new for the object it is being called on.

$nb->from_string($json);

Will die on failure to parse the string as JSON, if "format" in the JSON is not the name of this module, if "version" is newer than the supported model format version, or if the parsed JSON does not look like a saved model.

If it dies, the current model is left unchanged.

save

Saves the model to the specified file as JSON via to_string. The write is done atomically, written to a temporary file and then renamed into place, so the file will never contain a partially written model.

$nb->save('model.json');

Will die if no file is specified or on failure to write the file.

load

Loads the model from the specified file via from_string, replacing the current model.

$nb->load('model.json');

Will die if no file is specified, on failure to read the file, failure to parse it as JSON, or if the parsed JSON does not look like a saved model.

If it dies, the current model is left unchanged.

MODEL FORMAT

The model as produced by to_string and save is a JSON hash like the below.

{
   "format" : "Algorithm::Classifier::NaiveBayes",
   "version" : 1,
   "smoothing" : "laplace",
   "alpha" : 1,
   "ngrams" : 1,
   "token_weighting" : "count",
   "priors" : "trained",
   "class_counts" : {
      "ham" : 1,
      "spam" : 1
   },
   "class_totals" : {
      "ham" : 4,
      "spam" : 4
   },
   "token_counts" : {
      "ham" : {
         "at" : 1,
         "meeting" : 1,
         "noon" : 1,
         "tomorrow" : 1
      },
      "spam" : {
         "buy" : 1,
         "cheap" : 1,
         "now" : 1,
         "pills" : 1
      }
   },
   "tokens" : {
      "at" : 1,
      "buy" : 1,
      "cheap" : 1,
      "meeting" : 1,
      "noon" : 1,
      "now" : 1,
      "pills" : 1,
      "tomorrow" : 1
   },
   "total_docs" : 2,
   "lc_tokens" : 1,
   "token_splitter" : "\\s+",
   "stop_regex" : null
}

The keys are as below.

format - The name of this module. Used by from_string to make sure
    the JSON is actually a saved model.

version - The version of the model format. Currently 1. from_string
    will refuse to load a model with a version newer than it
    understands. Models missing any of the optional tunables,
    smoothing, alpha, ngrams, token_weighting, or priors, are
    loaded with those keys defaulted.

class_counts - Per class count of how many documents have been
    trained.

class_totals - Per class count of how many tokens have been
    trained.

token_counts - Per class hash of token to how many times that
    token has been trained.

tokens - A hash of every token trained across all classes. The
    size of this is the vocabulary size used for smoothing.

total_docs - Total number of documents trained across all classes.

lc_tokens, token_splitter, stop_regex, ngrams - The tokenizer
    settings as documented under new.

smoothing, alpha - The smoothing settings as documented under new.

token_weighting - The token weighting setting as documented under
    new.

priors - The class prior setting as documented under new.

AUTHOR

Zane C. Bowers-Hadley, <vvelox at vvelox.net>

BUGS

Please report any bugs or feature requests to bug-algorithm-classifier-naivebayes at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Algorithm-Classifier-NaiveBayes. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Algorithm::Classifier::NaiveBayes

You can also look for information at:

RT: CPAN's request tracker (report bugs here)

https://rt.cpan.org/NoAuth/Bugs.html?Dist=Algorithm-Classifier-NaiveBayes
CPAN Ratings

https://cpanratings.perl.org/d/Algorithm-Classifier-NaiveBayes
Search CPAN

https://metacpan.org/release/Algorithm-Classifier-NaiveBayes

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

This is free software, licensed under:

The GNU Lesser General Public License, Version 2.1, February 1999

To install Algorithm::Classifier::NaiveBayes, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Algorithm::Classifier::NaiveBayes

CPAN shell

perl -MCPAN -e shell
install Algorithm::Classifier::NaiveBayes

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)