Bloom16 - Perl extension for "threshold" Bloom filters
use Bloom16; $b = new Bloom16;
Efficiently recognize how many times an item has been seen.
Bloom filters are a very nifty way of determining set membership to a high degree of accuracy with minimal storage requirements. Use it if a small number of false positives are compatible with your requirements.
The idea is to create a large bit vector and generate several hash codes for each data item. Flip all corresponding bits to "1", so that if any bit is "0" when entering an item, it is garuanteed to be new. If all bits are "1" it has already been seen (to a high probability). False positives are very low until a certain capacity has been reached, namely ~0.15N where N is the number of bits in the vector.
In other words, you need less than one byte for recognizing each item. This implementation adds a counter of 4 bits (bringing requirements to ~4 bytes per item) which is used to give the number of times an item has been seen (up to 15). This is useful if you need to output items only once but you don't want to have to store all seen items, and further that you only want to output them if they occur more than say 10 times. Simply output only in the case that the filter has seen them exactly 10 times.
The particular constants used in this implementation are: K = number of hashes per item (4) N = number of bits in hash (I*8) rounded to the next-highest power of 2.
This yields 4-8 megs per million items to store.