NAME

``  Text::Bloom - Evaluate Bloom signature of a set of terms``

SYNOPSIS

``````  my \$b = Text::Bloom->new();
\$b->Compute( qw( foo bar baz ) );
my \$sig = \$b->WriteToString();
\$b->WriteToFile( 'afile.sig' );
my \$b2 = Text::Bloom::NewFromFile( 'afile.sig' );
my \$b3 = Text::Bloom->new();
\$b3->Compute( qw( foo bar barbaz ) );
my \$sim = \$b->Similarity( \$b2 );
my \$b4 = Text::Bloom::NewFromString( \$sig );``````

DESCRIPTION

`Text::Bloom` applies the Bloom filtering technique to the statistical analysis of documents.

The terms in the document are quantized using a base-36 radix representation; each term thus corresponds to an integer in the range 0..p-1, where p is a prime, currently set to the greatest prime less than 2^32.

Each quantized value is mapped to d integers in the range 0..size-1, where size is an integer less than p, currently 2^17, using a family of hash functions, computed by the `HashV` function.

Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0.

Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains n distinct terms, in the resulting bit vector at most n * d bits are set to 1.

The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a signature. Moreover, it does not depend on a pre-set dictionary of terms.

The signature may be used for:

• testing whether a given set of terms is present in the document,

• computing which fraction of terms are common to two documents.

The bit representation may be written to and read from a file. `Text::Bloom` prepends a header to the bit stream proper; moreover, whenever the package `Compress::Zlib` is available, the bit vector is compressed, so that disk space requirements are drastically reduced, especially for small documents.

The hash function is obviously a crucial component of the filter; the reference implementation uses a radix representation of strings. Each term must therefore match the regular expression `/[0-9a-z]+/`.

There are quite a few viable alternatives, which can be pursued by subclassing and redefining the method `QuantizeV`.

FORESEEN REUSE

The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a `V` suffix. Redefining other methods will require greater attention.

CLASS METHODS

new

The constructor. No arguments are required.

``  \$b = Text::Bloom->new();``

NewFromString

Take a string written by `WriteToString` (see below) and create a new `Text::Bloom` with the same contents; call `die` whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable.

``  my \$b = Text::Bloom::NewFromString( \$str );``

The return value is a blessed reference; put in another way, this is an alternative contructor.

The string should have been written by `WriteToString`; you may of course tweak the string contents, but at this point you're entirely on you own.

NewFromFile

Utility function that reads a binary file and performs a `NewFromString` on its content; see its counterpart, `WriteToFile`.

``  my \$b2 = Text::Document::NewFromFile( 'foo.sig' );``

INSTANCE METHODS

Size

Set and get the size of the filter, in bits. The default size is currently 128K.

``````  print 'size is ' . \$b->Size() . "\n";
\$b->Size( 65536 );``````

The `Size` method must be called before the `Compute` method in order to have effect.

Compute

Compute the Bloom signature from the given set of words and store it internally.

``  \$b->Compute( qw( foo bar baz foobar bazbaz ) );``

Makes use of the `QuantizeV` method.

QuantizeV

Convert a term into an integer; must return an integer in the range 0 .. `\$Text::Bloom::p-1`.

It is called as

``  my \$hash = \$b->QuantizeV( \$term );``

The current version is designed for strings matching `/[a-z0-9]+/`. Other characters do not cause errors, but degrade the hash function performance.

This function is a likely candidate for redefinition.

HashV

Convert an integer to a (smaller) integer, according to one of a class of similar functions.

It is internally called as:

``  my \$index = \$b->HashV( \$order, \$value );``

The `\$value` must belong to the interval 0..`\$Text::Bloom::p-1`, while the index must lie in 0..size-1. `\$order` is a small integer from 0 to d-1.

The default implementation is

``  index = m[order] * value + q[order]   (mod size) ``

the values of m and q are taken from the array `@Text::Bloom::hashParam`; the form of the function is taken from [2].

WriteToString

Convert the Bloom signature into a string which can be saved and later restored with `NewFromString`. `Compute` must have been called previously.

``  my \$str = \$b->WriteToString();``

The string begins with a header which encodes the originating package, its version, the parameters of the current instance.

Whenever possible, `Compress::Zlib` is used in order to compress the bit vector in the most efficient way. On systems without `Compress::Zlib`, the bit string is saved uncompressed.

WriteToFile

These convenience functions just call their String counterparts and read/write the file specified in the argument.

``  \$b->WriteToFile( 'foo.sig' );``

AUTHORS

``````  spinellia@acm.org (Andrea Spinelli)
walter@humans.net (Walter Vannini)``````

BIBLIOGRAPHY

[1]

Burton H. Bloom, "Space/time trade-offs in hash coding with allowable errors", Communications of the ACM, 13, 7, July 1970, pages 422-426. (available electronically from ACM Digital Library).

[2]

M. V. Ramakrishna, "Practical Performance of Bloom FIlters and Parallel Free-Text Searching", Communications of the ACM, 32, 10, October 1989, pages 1237-1239. (available electronically from ACM Digital Library).

BUGS

On Win32 we have experienced some instabilities when dealing with a large number of signatures; in this case Perl crashes without apparent explanation. The main suspect is Bit::Vector, but without any evidence.

HISTORY

``  2001-11-02 - initial revision``