NAME
Algorithm::Classifier::NaiveBayes - A multinomial naive Bayes text classifier with Laplace smoothing.
VERSION
Version 0.0.1
SYNOPSIS
use Algorithm::Classifier::NaiveBayes;
my $nb = Algorithm::Classifier::NaiveBayes->new;
# train it with examples of each class
$nb->train( 'spam', 'buy cheap pills now' );
$nb->train( 'spam', 'cheap watches for sale' );
$nb->train( 'ham', 'meeting at noon tomorrow' );
$nb->train( 'ham', 'lunch with the team' );
# classify some new text
my $class = $nb->classify('cheap pills for sale');
# $class is now 'spam'
# or get the score and probability for every class as well
my ( $best, $scores, $probs ) = $nb->classify('cheap pills for sale');
# save the model for later and load it again
$nb->save('model.json');
my $loaded = Algorithm::Classifier::NaiveBayes->new;
$loaded->load('model.json');
DESCRIPTION
This module implements a multinomial naive Bayes classifier. Strings are broken into tokens and each class is scored using the log of its prior probability, based on how often the class was trained, plus the sum of the log probabilities of each token appearing in that class. Token probabilities are smoothed so tokens never seen for a class do not zero out the whole score. By default this is add-one, Laplace, smoothing, but Lidstone, add-alpha, smoothing with a configurable alpha may be selected instead. Smaller alphas, such as 0.1 to 0.5, often perform better on small training sets.
By default token occurrences are weighted by their raw counts, but binary weighting, counting each unique token once per document, may be selected instead via token_weighting. Class priors default to how often each class was trained, but may be set to uniform via priors.
Classes are not predefined. A class exists once something has been trained for it and stops existing if everything for it is untrained.
The model may be saved to a JSON file or string and loaded back later, allowing training and classification to happen in different processes.
METHODS
new
Initiates the object.
my $nb = Algorithm::Classifier::NaiveBayes->new(%args);
The following args are supported.
lc_tokens - Lowercase tokens when tokenizing.
Default: 1
token_splitter - Regex to use for splitting a string into tokens.
Default: \s+
stop_regex - If defined, tokens matching this regex are dropped.
Matched anchored, so it must match the entire token.
Default: undef
smoothing - The smoothing to use for token probabilities. Either
"laplace", add-one, or "lidstone", add-alpha.
Default: laplace
alpha - The alpha to use for lidstone smoothing. Must be a number
greater than 0. May only be specified when smoothing is set to
lidstone. Laplace smoothing is lidstone with a alpha of 1.
Default: 0.5
ngrams - Max size of n-grams to generate from adjacent tokens when
tokenizing. 1 means single tokens only. 2 means also generate
each adjacent pair of tokens joined by a space. 3 also adds
triplets and so on.
Default: 1
token_weighting - How token occurrences are weighted. "count" uses
raw counts, so a token appearing three times in a document
counts three times. "binary" counts each unique token once per
document, both when training and classifying, which often works
better for short texts. Also known as binarized multinomial
naive Bayes.
Default: count
priors - How class priors are computed when classifying. "trained"
uses how often each class was trained, so classes with more
documents are favored. "uniform" gives every class a equal
prior, useful when the training set is unbalanced in a way real
usage will not be.
Default: trained
token_splitter and stop_regex may be either a string or a qr// Regexp.
Will die if passed a unknown arg or if token_splitter or stop_regex is a empty string, a ref other than a qr// Regexp, or does not compile as a regex.
Some examples...
# split on commas instead of whitespace
my $nb = Algorithm::Classifier::NaiveBayes->new( 'token_splitter' => ',' );
# keep the case of tokens
my $nb = Algorithm::Classifier::NaiveBayes->new( 'lc_tokens' => 0 );
# drop some common stop words
my $nb = Algorithm::Classifier::NaiveBayes->new( 'stop_regex' => qr/a|an|and|the|of|to/ );
# use lidstone smoothing with a alpha of 0.1
my $nb = Algorithm::Classifier::NaiveBayes->new( 'smoothing' => 'lidstone', 'alpha' => 0.1 );
# also generate bigrams, so phrases like "free cruise" become tokens
my $nb = Algorithm::Classifier::NaiveBayes->new( 'ngrams' => 2 );
# count each unique token once per document
my $nb = Algorithm::Classifier::NaiveBayes->new( 'token_weighting' => 'binary' );
# give every class a equal prior regardless of training balance
my $nb = Algorithm::Classifier::NaiveBayes->new( 'priors' => 'uniform' );
tokenize
Tokenizes the specified string. This is used internally by train, untrain, and classify, but may also be called directly to see how a string will be broken up.
my @tokens = $nb->tokenize($string);
The string is split via the token_splitter regex. Empty tokens are dropped. If lc_tokens is true, tokens are lowercased. If stop_regex is defined, tokens entirely matching it are dropped.
If ngrams is greater than 1, n-grams up to that size are generated from adjacent tokens and appended, joined by a space. This happens after lowercasing and stop word removal, so stop words do not appear inside n-grams.
my $nb = Algorithm::Classifier::NaiveBayes->new( 'ngrams' => 2 );
my @tokens = $nb->tokenize('Free Cruise Inside');
# ( 'free', 'cruise', 'inside', 'free cruise', 'cruise inside' )
Will die if the string is undef. As train, untrain, and classify all use this, passing undef text to any of those will also die.
my $nb = Algorithm::Classifier::NaiveBayes->new;
my @tokens = $nb->tokenize('Buy Cheap Pills');
# ( 'buy', 'cheap', 'pills' )
train
Train a specific class on the specified string.
$nb->train($class, $string);
Will die if the class or string is undef.
The class does not need to exist prior to this being called. Training a new class name brings that class into existence.
$nb->train( 'spam', 'buy cheap pills now' );
$nb->train( 'ham', 'meeting at noon tomorrow' );
untrain
Untrain a specific class on the specified string, reversing a previous call to train with the same class and string.
$nb->untrain($class, $string);
Will die if the class or string is undef.
If the class in question has not been trained, this is a noop. Token counts will not be decremented below zero and classes with no remaining trained documents are removed from the model.
# trained into the wrong class, so move it
$nb->untrain( 'ham', 'buy cheap pills now' );
$nb->train( 'spam', 'buy cheap pills now' );
It is worth noting it can't be verified the string in question was actually previously trained for that class. Untraining a string that differs from what was trained will still decrement the document count for the class, along with whatever tokens overlap.
prune
Removes all tokens trained fewer than the specified number of times, totaled across all classes.
my $pruned = $nb->prune($min_count);
Real world training data tends to accumulate a long tail of tokens only seen once or twice. Those add noise and bloat the saved model, so pruning them can be useful after a large amount of training.
# remove all tokens only trained once
my $pruned = $nb->prune(2);
Returns the number of tokens removed. Removed tokens are dropped from the vocabulary and the per class token totals are decremented, but document counts are untouched, so class priors are unchanged.
Will die if min count is undef or not a whole number greater than 0. A min count of 1 is a noop as every trained token has a count of at least 1.
classes
Returns a sorted list of all currently trained classes.
my @classes = $nb->classes;
If nothing has been trained yet, an empty list is returned.
class_tokens
Returns a sorted list of all tokens trained for the specified class.
my @tokens = $nb->class_tokens($class);
Will die if no class is specified or if the class in question does not exist.
foreach my $class ( $nb->classes ) {
print $class . ': ' . join( ', ', $nb->class_tokens($class) ) . "\n";
}
classify
Classify the text in question.
my $class = $nb->classify($text);
In scalar context, returns the name of the class the text most likely belongs to. In list context, also returns a hash ref of the score for every class as well as a hash ref of the probability of every class.
my ( $class, $scores, $probs ) = $nb->classify($text);
foreach my $possible ( sort { $scores->{$b} <=> $scores->{$a} } keys %{$scores} ) {
print $possible . ': ' . $scores->{$possible} . ', ' . $probs->{$possible} . "\n";
}
The scores are log probabilities, so they are negative numbers with the one closest to zero being the most likely.
The probabilities are the scores normalized to sum to 1, so they may be used for things like requiring a minimum confidence.
my ( $class, $scores, $probs ) = $nb->classify($text);
if ( $probs->{$class} < 0.8 ) {
$class = 'unsure';
}
It is worth noting naive Bayes probabilities tend to be overconfident thanks to the assumption tokens are independent of each other, with longer texts commonly producing probabilities very close to 1 or 0. They are good for ranking and thresholding, but should not be taken as calibrated probabilities.
If nothing has been trained yet, undef is returned in scalar context and ( undef, {}, {} ) in list context.
Ties are broken by sorting the tied class names, making the result deterministic.
explain
Classifies the text in question like classify, but returns a hash ref breaking down how the result was arrived at.
my $explanation = $nb->explain($text);
The returned hash ref is as below.
class - The best matching class, as classify would return.
scores - Hash ref of the log score of every class, as classify
would return.
probs - Hash ref of the probability of every class, as classify
would return.
priors - Hash ref of the log prior probability of every class,
the part of the score that comes from how often the class was
trained rather than from the tokens.
tokens - Hash ref of every token in the tokenized text. Each value
is a hash ref with "count", how many times the token appeared
in the text, and "contributions", a hash ref of the log
probability that token added to each class per appearance.
For any class, the score is the prior plus count * contribution summed over every token. A token pushes towards the class it has the highest, closest to zero, contribution for. So finding the tokens most responsible for a classification can be done like below.
my $explanation = $nb->explain($text);
my ( $first, $second ) =
sort { $explanation->{'scores'}{$b} <=> $explanation->{'scores'}{$a} }
keys %{ $explanation->{'scores'} };
foreach my $token ( keys %{ $explanation->{'tokens'} } ) {
my $contribs = $explanation->{'tokens'}{$token}{'contributions'};
my $pull = ( $contribs->{$first} - $contribs->{$second} )
* $explanation->{'tokens'}{$token}{'count'};
print $token . ' pushed towards ' . $first . ' by ' . $pull . "\n";
}
Will die if the text is undef. If nothing has been trained yet, undef is returned.
tweak
Changes scoring settings on a existing model. Takes the args below, all optional, but at least one must be specified.
smoothing - The smoothing to use... laplace or lidstone.
alpha - The alpha to use for lidstone smoothing. Must be a number
greater than 0. May only be specified when the resulting
smoothing is lidstone.
priors - How class priors are computed... trained or uniform.
# switch to lidstone smoothing with a alpha of 0.1
$nb->tweak( 'smoothing' => 'lidstone', 'alpha' => 0.1 );
# switch to uniform priors
$nb->tweak( 'priors' => 'uniform' );
These are safe to change after training as they only affect scoring, not the trained counts. Settings that shape the trained data, such as ngrams, token_weighting, and the tokenizer settings, may not be changed here as that would make the model inconsistent with what was trained... for those, create a new object and retrain.
Only args specified with a defined value are changed. Args passed with a undef value are ignored, so it is safe to pass through possibly unset values.
Switching smoothing to laplace sets alpha to 1, as laplace is add-one. Switching to lidstone without specifying alpha keeps the current alpha.
Will die if passed a unknown arg, no args with defined values, or a insane value. If it dies, the model is left unchanged.
to_string
Returns the model as a JSON string. See the section MODEL FORMAT for what the JSON looks like.
my $json = $nb->to_string;
The JSON is generated with canonical set, so the keys are sorted, meaning two calls against the same model will always produce identical output, making it diffable.
If token_splitter or stop_regex was set to a qr// Regexp, it is stringified, so the result is always JSON safe.
from_string
Loads the model from the specified JSON string, replacing the current model, including any settings passed to new for the object it is being called on.
$nb->from_string($json);
Will die on failure to parse the string as JSON, if "format" in the JSON is not the name of this module, if "version" is newer than the supported model format version, or if the parsed JSON does not look like a saved model.
If it dies, the current model is left unchanged.
save
Saves the model to the specified file as JSON via to_string. The write is done atomically, written to a temporary file and then renamed into place, so the file will never contain a partially written model.
$nb->save('model.json');
Will die if no file is specified or on failure to write the file.
load
Loads the model from the specified file via from_string, replacing the current model.
$nb->load('model.json');
Will die if no file is specified, on failure to read the file, failure to parse it as JSON, or if the parsed JSON does not look like a saved model.
If it dies, the current model is left unchanged.
MODEL FORMAT
The model as produced by to_string and save is a JSON hash like the below.
{
"format" : "Algorithm::Classifier::NaiveBayes",
"version" : 1,
"smoothing" : "laplace",
"alpha" : 1,
"ngrams" : 1,
"token_weighting" : "count",
"priors" : "trained",
"class_counts" : {
"ham" : 1,
"spam" : 1
},
"class_totals" : {
"ham" : 4,
"spam" : 4
},
"token_counts" : {
"ham" : {
"at" : 1,
"meeting" : 1,
"noon" : 1,
"tomorrow" : 1
},
"spam" : {
"buy" : 1,
"cheap" : 1,
"now" : 1,
"pills" : 1
}
},
"tokens" : {
"at" : 1,
"buy" : 1,
"cheap" : 1,
"meeting" : 1,
"noon" : 1,
"now" : 1,
"pills" : 1,
"tomorrow" : 1
},
"total_docs" : 2,
"lc_tokens" : 1,
"token_splitter" : "\\s+",
"stop_regex" : null
}
The keys are as below.
format - The name of this module. Used by from_string to make sure
the JSON is actually a saved model.
version - The version of the model format. Currently 1. from_string
will refuse to load a model with a version newer than it
understands. Models missing any of the optional tunables,
smoothing, alpha, ngrams, token_weighting, or priors, are
loaded with those keys defaulted.
class_counts - Per class count of how many documents have been
trained.
class_totals - Per class count of how many tokens have been
trained.
token_counts - Per class hash of token to how many times that
token has been trained.
tokens - A hash of every token trained across all classes. The
size of this is the vocabulary size used for smoothing.
total_docs - Total number of documents trained across all classes.
lc_tokens, token_splitter, stop_regex, ngrams - The tokenizer
settings as documented under new.
smoothing, alpha - The smoothing settings as documented under new.
token_weighting - The token weighting setting as documented under
new.
priors - The class prior setting as documented under new.
AUTHOR
Zane C. Bowers-Hadley, <vvelox at vvelox.net>
BUGS
Please report any bugs or feature requests to bug-algorithm-classifier-naivebayes at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Algorithm-Classifier-NaiveBayes. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Algorithm::Classifier::NaiveBayes
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
https://rt.cpan.org/NoAuth/Bugs.html?Dist=Algorithm-Classifier-NaiveBayes
CPAN Ratings
https://cpanratings.perl.org/d/Algorithm-Classifier-NaiveBayes
Search CPAN
https://metacpan.org/release/Algorithm-Classifier-NaiveBayes
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
This software is Copyright (c) 2026 by Zane C. Bowers-Hadley.
This is free software, licensed under:
The GNU Lesser General Public License, Version 2.1, February 1999