The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Mail::SpamTest::Bayesian - Perl extension for Bayesian spam-testing

SYNOPSIS

  use Mail::SpamTest::Bayesian;

  my $j=Mail::SpamTest::Bayesian->new(dir => '.');
  $j->init_db;
  $j->merge_mbox_spam($scalar_spam_box);
  $j->merge_mbox_nonspam($scalar_nonspam_box);
  $message=$j->markup_message($message);

DESCRIPTION

This module implements the Bayesian spam-testing algorithm described by Paul Graham at:

http://www.paulgraham.com/spam.html

In short: the system is trained by exposure to mailboxes of known spam and non-spam messages. These are (1) MIME-decoded, and non-text parts deleted; (2) tokenised. The database files spam.db and nonspam.db contain lists of tokens and the number of messages in which they have occurred; general.db holds a message count.

This module is in early development; it is functional but basic. It is expected that more mailbox parsing routines will be added, probably using Mail::Box; and that ancillary programs will be supplied for use of the module as a personal mail filter.

METHODS

new()

Standard constructor. Pass a hash or hashref with parameters.

Useful parameters: dir -> database directory (.) significant -> number of significant tokens to consider (15) threshold -> spam threshold (0.9) fudgefactor -> Non-spam priority (2)

init_db()

Deletes and re-initialises databases. Call this only once, when you first set up the database.

merge_mbox_spam()

Train the system by giving it a mailbox full of spam.

Pass a scalar or array or arrayref containing raw messages.

merge_mbox_nonspam()

Train the system by giving it a mailbox full of legitimate email.

Pass a scalar or array or arrayref containing raw messages.

merge_stream_spam()

Pass a stream (pointing to an mbox file) from which to read messages. For example, an IO::File object.

merge_stream_nonspam()

Pass a stream (pointing to an mbox file) from which to read messages.

merge_message_spam()

As merge_mbox_spam, but for a single message; pass in a scalar.

merge_message_nonspam()

As merge_mbox_nonspam, but for a single message; pass in a scalar.

markup_message()

Test a message for possible spammishness. Pass a scalar containing a single message. Will return the original message with inserted headers:

  X-Bayesian-Spam: (YES|NO) (probability%)
  X-Bayesian-Test: the significant tests and their weights

test_message()

Pass a scalar containing a single message. Returns a list:

  0: spam status (1 for spam, 0 for non spam)
  1: probability of spam
  2: listref of significant tests

AUTHOR

Roger Burton West, <roger@firedrake.org>

ACKNOWLEDGEMENTS

Erwin Harte provided useful feedback and the de-MIMEing code.

SEE ALSO

perl, BerkeleyDB.