find_duplicate_perl lib some_other_dir


Takes a list of directories and files as arguments and searches for duplicate Perl code in all files it finds there. For directories, we (currently) match files ending in .pm, .pl and .t. This limitation can be avoided by passing in the list of files directly ... or submitting a patch.

Because the program can take a long time to run, we use Term::ProgressBar to track the progress. Note that even though it shows an ETA (estimated time of arrival) for how long we'll take to run, this number is often wildly inaccurate and is there to make you feel better.


 --window=$window    Set minimum number of lines to look for duplicate code (default 5)
 --exact             If used, will ignore renamed variables and subs
 --ignore            A regex of duplicate code snippets to ignore (may be repeated)
 --show_warnings     If for some reason a file cannot load, use this to show the reason why
 --jobs              Number of jobs to run (default 1)
 --threshold         Between 0 and 1. % of lines which must match C<\w> (default .75)
  • --window=5

    Takes an integer argument.

    By default, we compare five lines of code per file with five lines of code in other files. You can use this option to change the window size. For example to be very agressive:

     --window 3
    =item * C<--exact>

    Takes no arguments.

    By default, we ignore differences in variable names and subroutine names. The following will be considered duplicates:

         return $url;                  |     return $table;
     }                                 | }
     sub _build_external_urls {        | sub run {
         my($self) = @_;               |     my($self) = @_;
         my $request = $self->request; |     my $request = $self->request

    You may pass the --exact flag to say that variable names and subroutine names must match exactly.

  • --ignore='Exception-throw\("Undefined url:'>

    Takes a string argument. String will be interpreted as a regex.

    Used to pass regular expressions which, if matching a block of duplicate code, will cause that block to not be reported as duplicated. You may repeat this switch, if needed. This is very useful when you have large blocks of auto-generated code.

  • --show_warnings

    Takes no arguments.

    When we're looking for duplicates, we normalize the program layout via a customized version of B::Deparse. Sometimes we cannot load our target program (for example, if it does not compile). A brief warning will be emitted. You can pass --show_warnings to get the full warning, if needed.

  • --jobs=4

    Takes an integer argument.

    By default we only use one job. If you pass an integer to this, we will attempt to launch (fork) that many jobs. We use Parallel::ForkManager for this. This can dramatically speed up a search for duplicate code.

  • --threshold=.5

    Takes a floating point argument between 0 and 1, inclusive.

    The --threshold represents a percentage. If a duplicate section of code is found, the percentage number of lines of code containing "word" characters must exceed the threshold. This is done to prevent spurious reporting of chunks of code like this:

             };          |         };
         }               |     }
         return \@data;  |     return \@attrs;
     }                   | }
     sub _confirm {      | sub _execute {

    The above code has only 40% of its lines containing word (qr/\w/) characters, and thus will not be reported.