 NAME
 SYNOPSIS
 DESCRIPTION
 FORESEEN REUSE
 CLASS METHODS
 INSTANCE METHODS
 AUTHORS
 BIBLIOGRAPHY
 BUGS
 HISTORY
NAME
Text::Bloom  Evaluate Bloom signature of a set of terms
SYNOPSIS
my $b = Text::Bloom>new();
$b>Compute( qw( foo bar baz ) );
my $sig = $b>WriteToString();
$b>WriteToFile( 'afile.sig' );
my $b2 = Text::Bloom::NewFromFile( 'afile.sig' );
my $b3 = Text::Bloom>new();
$b3>Compute( qw( foo bar barbaz ) );
my $sim = $b>Similarity( $b2 );
my $b4 = Text::Bloom::NewFromString( $sig );
DESCRIPTION
Text::Bloom
applies the Bloom filtering technique to the statistical analysis of documents.
The terms in the document are quantized using a base36 radix representation; each term thus corresponds to an integer in the range 0..p1, where p is a prime, currently set to the greatest prime less than 2^32.
Each quantized value is mapped to d integers in the range 0..size1, where size is an integer less than p, currently 2^17, using a family of hash functions, computed by the HashV
function.
Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0.
Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains n distinct terms, in the resulting bit vector at most n * d bits are set to 1.
The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a signature. Moreover, it does not depend on a preset dictionary of terms.
The signature may be used for:
testing whether a given set of terms is present in the document,
computing which fraction of terms are common to two documents.
The bit representation may be written to and read from a file. Text::Bloom
prepends a header to the bit stream proper; moreover, whenever the package Compress::Zlib
is available, the bit vector is compressed, so that disk space requirements are drastically reduced, especially for small documents.
The hash function is obviously a crucial component of the filter; the reference implementation uses a radix representation of strings. Each term must therefore match the regular expression /[09az]+/
.
There are quite a few viable alternatives, which can be pursued by subclassing and redefining the method QuantizeV
.
FORESEEN REUSE
The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a V
suffix. Redefining other methods will require greater attention.
CLASS METHODS
new
The constructor. No arguments are required.
$b = Text::Bloom>new();
NewFromString
Take a string written by WriteToString
(see below) and create a new Text::Bloom
with the same contents; call die
whenever the restore is impossible or illadvised, for instance when the current version of the package is different from the original one, or the compression library in unavailable.
my $b = Text::Bloom::NewFromString( $str );
The return value is a blessed reference; put in another way, this is an alternative contructor.
The string should have been written by WriteToString
; you may of course tweak the string contents, but at this point you're entirely on you own.
NewFromFile
Utility function that reads a binary file and performs a NewFromString
on its content; see its counterpart, WriteToFile
.
my $b2 = Text::Document::NewFromFile( 'foo.sig' );
INSTANCE METHODS
Size
Set and get the size of the filter, in bits. The default size is currently 128K.
print 'size is ' . $b>Size() . "\n";
$b>Size( 65536 );
The Size
method must be called before the Compute
method in order to have effect.
Compute
Compute the Bloom signature from the given set of words and store it internally.
$b>Compute( qw( foo bar baz foobar bazbaz ) );
Makes use of the QuantizeV
method.
QuantizeV
Convert a term into an integer; must return an integer in the range 0 .. $Text::Bloom::p1
.
It is called as
my $hash = $b>QuantizeV( $term );
The current version is designed for strings matching /[az09]+/
. Other characters do not cause errors, but degrade the hash function performance.
This function is a likely candidate for redefinition.
HashV
Convert an integer to a (smaller) integer, according to one of a class of similar functions.
It is internally called as:
my $index = $b>HashV( $order, $value );
The $value
must belong to the interval 0..$Text::Bloom::p1
, while the index must lie in 0..size1. $order
is a small integer from 0 to d1.
The default implementation is
index = m[order] * value + q[order] (mod size)
the values of m and q are taken from the array @Text::Bloom::hashParam
; the form of the function is taken from [2].
WriteToString
Convert the Bloom signature into a string which can be saved and later restored with NewFromString
. Compute
must have been called previously.
my $str = $b>WriteToString();
The string begins with a header which encodes the originating package, its version, the parameters of the current instance.
Whenever possible, Compress::Zlib
is used in order to compress the bit vector in the most efficient way. On systems without Compress::Zlib
, the bit string is saved uncompressed.
WriteToFile
These convenience functions just call their String counterparts and read/write the file specified in the argument.
$b>WriteToFile( 'foo.sig' );
AUTHORS
spinellia@acm.org (Andrea Spinelli)
walter@humans.net (Walter Vannini)
BIBLIOGRAPHY
 [1]

Burton H. Bloom, "Space/time tradeoffs in hash coding with allowable errors", Communications of the ACM, 13, 7, July 1970, pages 422426. (available electronically from ACM Digital Library).
 [2]

M. V. Ramakrishna, "Practical Performance of Bloom FIlters and Parallel FreeText Searching", Communications of the ACM, 32, 10, October 1989, pages 12371239. (available electronically from ACM Digital Library).
BUGS
On Win32 we have experienced some instabilities when dealing with a large number of signatures; in this case Perl crashes without apparent explanation. The main suspect is Bit::Vector, but without any evidence.
HISTORY
20011102  initial revision