The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

App::dupfind::Common - Public methods for the App::dupfind deduplication engine

VERSION

version 0.172690

DESCRIPTION

Together with App::dupfind::Guts, the methods from this module are composed into the App::dupfind class in order to provide the user with the high-level methods that are directly callable from the user's application.

INTERNALS

There are some implementation-based concepts that don't really matter to the end user, but which are briefly discussed here because they form concepts that are used throughout the entirety of the codebase and are referred to by a large number of documented class methods.

THE DUPLICATES MASTER HASH

Potential duplicate files are kept in groupings of same-size files, organized by file size. They are tracked in a hashref datastructure.

Specifically, the keys of the hashref are the integers indicating file sizes in bytes. The corresponding value for each of these "size" keys is a listref containing the group of filenames that are of that file size.

Random example:

   $dupes =
      {
         0 => # a zero-size file
            [
               '~/run/some_file.lock',
            ],

         1024 => # some files that are 1024 bytes
            [
               '~/Pictures/kitty.jpg',
               '~/Pictures/cat.jpg'
               '~/.cache/foo'
            ],

         4096 => # some files that are 4096 bytes
            [
               '~/Documents/notes.txt',
               '~/Downloads/bar.gif',
            ],
      }

METHODS

cache_stats

Retrieve information about cache hits/misses that happened during the calculation of file digests in the digest_dups method. Used as part of the run summary that gets printed out at the end of execution of $bin/dupfind

Returns $cache_hits, $cache_misses (both integers)

count_dups

Examines its argument and sums up the number of its members. Expects a datastructure in the form of the master dupes hashref.

Returns $dup_count (integer)

delete_dups

Deletes duplicate files, optionally prompting the user for which files to delete and for confirmation of deletion (if command-line parameters supplied by the user dictate that interactive prompting is desired)

Returns nothing

digest_dups

Expects a datastructure in the form of the master dupes hashref.

Iterates over the datastructure and calculates digests for each of the files.

If ramcache is enabled (which is the default), a rudimentary caching mechanism is used in order to avoid calculating digests multiple times for files with the same content.

Returns a lexical copy of the duplicates hashref with non-dupes removed

get_size_dups

Scans the directory specified by the user and assembles the master dupes hashref datastructure as described above. Files with no same-size counterparts are not included in the datastructure.

Returns $dupes_hashref, $scan_count, $size_dup_count ... ...where $dupes_hashref is the master duplicates hashref, $scan_count is the number of files that were scanned, and $size_dup_count is the total number of same-size files encompassing each same-size group

opts

A read-only accessor method that returns a hashref of options as specified by either or both of the default settings and user input at invocation time.

Examples:

   $self->opts->{threads}  # contains the number of threads the user wants
   $self->opts->{dir}      # name of the directory to scan for duplicates
say_stderr

The same as Perl's built-in say function, except that:

  • It is a class method

  • It outputs to STDERR instead of STDOUT

show_dups

Expects a datastructure in the form of the master dupes hashref.

Produces the formatted output for $bin/dupfind based on what duplicate files were found during execution. Currently two output formats are supported: "human" and "robot". Logically, the robot output is easily machine-parsable, while the human output is more visually palatable to human users (it makes sense to people).

Returns the number of duplicates shown.

sort_dups

Expects a datastructure in the form of the master dupes hashref.

Iterates through the hashref and examines the listrefs of file names that comprise its values. It then sorts the listrefs in place with the following sort:

   sort { $a cmp $b }

Returns a lexical copy of the newly-sorted master duplicates hashref

Expects a datastructure in the form of the master dupes hashref.

Iterates through the hashref and examines the listrefs of file names that comprise its values.

For each file in each group, it looks at the underlying storage for the file on the storage medium using a stat call. Any files with the same device major number AND the same inode number are obvious hardlinks.

After alphabetizing any hard links that are detected, it throws out all hard links but the first one. This simplifies the output, and the easy rationale behind this is that a hard link constitutes a file that has already been deduplicated because it refers to the same underlying storage.

Returns a lexical copy of the master duplicates hashref

weed_dups

Expects a datastructure in the form of the master dupes hashref.

Runs the weed out pass(es) on the datastructure in an attempt to eliminate as many non-duplicate files as possible from the same-size file groupings without having to resort to resource-intensive file hashing (i.e.- the calculation of file digests).

If no duplicates remain after the weed out pass(es), then the need for hashing is obviated and it doesn't get performed. For any remaining potential duplicates however, the hashing is ultimately used to provide the final decision on file uniqueness.

One or more passes may be performed, based on user input. Currently the default is to use only one pass, with the "first_middle_last" weed-out algorithm which has proved so far to be the most efficient.

Returns a (hopefully reduced) lexical copy of the master duplicates hashref