dupfind - Detect and optionally remove duplicate files
Finds duplicate files in a directory tree. Options are explained in detail below. Options marked with an asterisk (*) are not yet implemented and are planned for a future release.
dupfind [ --options ] --dir ./path/to/search/ dupfind --threads 4 --maxdepth 100 --bytes 1099511627776 --dir /some/dir dupfind --ramcache 0 --format robot --dir ~/dedup/this
- -b, --bytes
Maximum file size in bytes that you are willing to compare. The current default maximum is 1 gigabyte.
Sizing guide: 1 kilobyte = 1024 1 megabyte = 1048576 or 1024 ** 2 1 gigabyte = 1073741824 or 1024 ** 3 1 terabyte = 1099511627776 or 1024 ** 4
Integer indicating the maximum file size to put into the cache of computed file digests. Note that this is NOT the max amount of RAM to consume for the cache. (see --ramcache) Default value: 1 megabyte
- -d, --dir
Name of the directory you want to search for duplicates
- -f, --format
Specify either "human" or "robot". Human-readable output is generated for easy viewing by default. If you want output that is machine-parsable, specify "robot"
- -g, --progress
Display a progress bar. Why "-g"? Because I ran out of "-p"s
*Follow symlinks (by default it does not). Because this has some safety implications and is a complex matter, it is not yet supported. Sorry, check back later.
- -m, --maxdepth
The maximum directory depth to which the comparison scan will recurse. Note that this does not mean the total number of directories to scan.
- -p, --prompt
Interactively prompt user to delete detected duplicates
- -q, --quiet
Provide no status messages prior to, and no summary messages after output.
Also turns off progress bar.
Integer indicating the number of bytes of RAM to consume for the cache of computed file digests. Note that dupfind will still use a substantial amount of memory for other internal purposes that don't have to do with the cache. Default: 100 megabytes. Set to 0 to disable ram cache.
- -x, --remove
CAUTION: Delete WITHOUT PROMPTING all but the first copy if duplicate files are found. This will leave you with no duplicate files when execution is finished.
- -t, --threads
Number of threads to use for file comparisons. Defaults to 0, or non-threaded.
You'll often get best performance on spindle devices by using a number of threads equal to the number of logical processors on your system, plus 1.
See CAVEATS for important info about threads.
- -v, --verbose
Gives you a progress bar (like --progress) and some extra, helpful output if you need more detail about file dupes detected by dupfind.
- -w, --weedout
Either yes or no. (Default yes). Tries to avoid unnecessary file hashing by weeding out potential duplicates with a simple, quick comparison of the last 1024 bytes of data in same-size files. This typically produces very significant performance gains, especially with large numbers of files.
One or more of the following "weeding" pass filters can be specified. Weed-out passes reduce the amount of cryptographic digest calculations that must happen by weeding out potential-duplicates:
first (checks first few bytes of each file) middle (checks the center-most single byte) last (checks the last few bytes of files) middle_last (checks middle byte and last few bytes) first_middle_last (first bytes, middle byte, last bytes)
The default is to only run the "first_middle_last" weed-out filter pass, which usually yields the best results in terms of speed.
Weed-out filters are executed in the order you specify.
Example usage: dupfind --wpass first --wpass last [...]
Integer indicating the number of bytes to read per file during a weed-out pass. Default: 32. If your weed-out pass type reads a file in two or more places, this value will be used for each read except "middle" reads, which are always 1 byte only.
Your Perl needs to be version 5.010 or better and has to support threading if you want to use the --threads option.
Deduplication is resource-intensive, and threading has a multiplier effect. The more threads you use, the more RAM will be consumed. If you are calling dupfind on a very large directory, you could risk running out of RAM.
An example of such a scenario would be to run dupfind with 8 threads against a 100 gigabyte directory containing 2 million files on a machine with only 2 GB of RAM. That will almost certainly run out of memory.
You should avoid attempting to run dupfind against very large volumes with several millions of files unless you have a lot of RAM, CPU, and time.
To contact the author, visit http://www.atrixnet.com/contact and leave a message
Copyright(C) 2013-2017, Tommy Butler. All rights reserved.
This library is free software, you may redistribute it and/or modify it under the same terms as Perl itself. For more details, see the full text of the LICENSE file that is included in this distribution.