Jan Pomikálek


Text::DeDuper - near duplicates detection module


    use Text::DeDuper;

    $deduper = new Text::DeDuper();
    $deduper->add_doc("doc1", $doc1text);
    $deduper->add_doc("doc2", $doc2text);
    @similar_docs = $deduper->find_similar($doc3text);


    # delete near duplicates from an array of texts
    $deduper = new Text::DeDuper();
    foreach $text (@texts)
        next if $deduper->find_similar($text);
        $deduper->add_doc($i++, $text);
        push @no_near_duplicates, $text;


This module uses the resemblance measure as proposed by Andrei Z. Broder at al (http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html) to detect similar (near-duplicate) documents based on their text.

Note of caution: The module only works correctly with languages where texts can be tokenised to words by detecting alphabetical characters sequences. Therefore it might not provide very good results for e.g. Chinese.



    $deduper = new Text::DeDuper(<attribute-value-pairs>);

Create a new DeDuper instance. Supported attributes are described bellow, in the Attributes section.


    $deduper->add_doc($document_id, $document_text);

Add a new document to the DeDuper's database. The $document_id must be unique for each document.



Returns (possibly empty) array of document IDs of documents in the DeDuper's database similar to the $document_text. This can be very simply used for testing whether a near-duplicate document is in the database:

    if ($deduper->find_similar($document_text))
        print "at least one near duplicate found";



Removes all documents from DeDuper's database.


Attributes can be set using the constructor:

    $deduper = new Text::DeDuper(
        ngram_size => 4,
        encoding   => 'iso-8859-1'

... or using the object methods:


The object methods can also be used for retrieving the values of the attributes:

    $ngram_size = $deduper->ngram_size();
    @stoplist   = $deduper->stoplist();

The characters encoding of processed texts. Must be set to correct value so that alphabetical characters could be detected. Accepted values are those supported by the Encode module (see Encode::Supported).

default: 'utf8'


The similarity treshold defines how similar two documents must be to be considered near duplicates. The boundary values are 0 and 1. The similarity value of 1 indicates that the documents are exactly the same. The value of 0 on the other hand means that the documents do not share any n-gram.

Any two documents will have the similarity value below the default treshold unless they share a significant part of text.

default: 0.2


The document similarity is based on the information of how many n-grams the documents have in common. An n-gram is a sequence of any n immeadiately subsequent words. For example the text

    she sells sea shells on the sea shore

contains following 5-grams:

    she sells sea shells on
    sells sea shells on the
    sea shells on the sea
    shells on the sea shore

This attribute specifies the value of n (the size of n-gram).

default: 5


The stoplist is a list of very frequent words for given language (for English e.g. a, the, is, ...). It is a good idea to remove the stoplist words from texts before similarity is computed, because it is quite likely that two documents will share n-grams of frequent words even if they are not similar at all.

The stoplist can be specified both as an array of words and as a name of a file where the words are stored one per line:

    $deduper->stoplist('a', 'the', 'is', @next_stopwords);

Do not worry if you do not have a stoplist for your language. DeDuper will do pretty good job even without the stoplist.

default: empty



For decoding texts in various characters encodings into Perl's internal form.


For n-grams hashing optimisation.


Please report any bugs or feature requests to bug-Text-DeDuper@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.


Encode, Encode::Supported, Digest::MD4

Andrei Z. Broder at al., Syntactic Clustering of the Web


Contains among other things definition of the resemblance measure.


Jan Pomikalek, <xpomikal@fi.muni.cz>


Copyright 2006 Jan Pomikalek, All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 83:

=cut found outside a pod block. Skipping to next block.