The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

findsimilars - Fast similar-files finder

SYNOPSIS

  [perl -S] findsimilars [--level=1] [dirs...]

DESCRIPTION

Similar-sized and similar-named files are picked as suspicious candidates of duplicated files.

Basic Usage

Nothing descirbes better than actual outputs. Here is an example of suspicious duplicated files:

  $ findsimilars -l 1 test
  ## =========
             3 'CardLayoutTest.java' 'test/'
             5 'TestLayout.java' 'test/'

  ## =========
             4 'Python Standard Library.chm'              'test/'
             4 'GNU - Python Standard Library (2001).chm' 'test/'
             5 'GNU - 2001 - Python Standard Library.rar' 'test/'

The first column is the size of the file, 2nd the name, and 3rd the path. findsimilars will pick similar-sized and similar-named files as suspicious candidates of duplicated files. Suspicious duplicated files are shown in groups. The motto for the listing is that, I would rather the program overkills (wrongly picking out suspicious ones) than neglects something that would cause me otherwise years to notice.

By default, findsimilars assumes that similar files within the same folder are OK. Hence you will not get duplicate warnings for generated files (like .o, .class or .aux, and .dvi files) or other file series.

Once you are sure that there are no duplications across different folders and want findsimilars to scoop deeper and further into same folder, specify the --level=1 command line switch (or -l 1). This is very good to eliminate similar mp3 files within the same folder, or downloaded files from big sites where different packaging methods are used, e.g.:

  ## =========
         66138 jdc-src.tar.gz  .../ftp.ora.com/published/oreilly/java/javadc
        147904 jdc-src.zip     .../ftp.ora.com/published/oreilly/java/javadc

Advanced Usages

The command line parameters can be a list of dir names or file names.

The command line parameter can also be a '-', a special case in which file information is read from stdin:

  find /path/you/want \( -type f -o -type l \) -follow -printf "%p\t%s\n" | findsimilars -

You can change the find (or findsimilars) parameters; cache or filter the find result, but the find output format has to be as shown.

AUTHOR

 @Author:  SUN, Tong <suntong at cpan.org>
 @HomeURL: http://xpt.sourceforge.net/

SEE ALSO

File::Compare(3), File::Find::Duplicates(3)

perl(1).

COPYRIGHT

Copyright (c) 1997-2008 Tong SUN. All rights reserved.

TODO