The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

File::ContentStore - A store for file content built with hard links

VERSION

version 1.004

SYNOPSIS

use File:::ContentStore;

# the 'path' argument is expected to exist
my $store = File:::ContentStore->new( path => "$ENV{HOME}/.photo_content" );
$store->link_dir( @collection_of_photo_directories );

DESCRIPTION

This module manages a content store as a collection of hard links to a set of files. The files in the content store are named after the digest of the content in the file.

When linking a new file to the content store, a hard link is created to the file, named after the digest of the content. When a file which content is already in the store is linked in, the file is hard linked to the content file in the store.

Example and detailed operation

For a more complete definition of a hard link, see https://en.wikipedia.org/wiki/Hard_link.

Assuming we have directory containing the following files: file1 (inode 123456), file2 (inode 456789) and file3 (inode 789012, content identical to file1). In the examples below, files are sorted by inode.

After linking file1 into the content store, we have the following:

Directory                Content store
---------                -------------
[123456] file1           [123456] d4/1d/8cd98f00b279d1c00998ecf8427e
[456789] file2
[789012] file3

After linking file2:

Directory                Content store
---------                -------------
[123456] file1           [123456] d4/1d/8cd98f00b279d1c00998ecf8427e
[456789] file2           [456789] 8a/80/52e7a4f99c54b966a74144fe5761
[789012] file3

And finally, after linking file3, we have this:

Directory                Content store
---------                -------------
[123456] file1           [123456] d4/1d/8cd98f00b279d1c00998ecf8427e
[123456] file3
[456789] file2           [456789] 8a/80/52e7a4f99c54b966a74144fe5761

i.e. the inode that was holding the content of file3 is lost, and the name now points to the same inode as file1 and its content file.

file1 and file3 are now hard linked (or aliased) together, so any change done to one of them will in fact be done to both. Note also that the disk space taken by duplicated extra files is regained when they are linked through the content store.

If the goal is deduplication and hard-linking of identical files, once all the files have been linked through the content store, the content store is not needed any more, and can be deleted.

Note that since permissions are attached to the inode (and not the individual files), this implies that, when linking a file with the content store, it will set the initial permissions of the content file if it does not exist, and otherwise inherit the permissions of the content file.

ATTRIBUTES

path

The location of the directory where the content files are stored. (Required.)

digest

The algorithm used to compute the content digest. (Default: SHA-1.)

Any string that is suitable for passing to the Digest module constructor is valid. The choice of a digest is a compromise between speed and risk of collisions.

parts

This internal attribute describes in how many parts (i.e. sub-directories) the content filename is split. It is computed automatically from digest.

For example, the empty file would be linked to:

# digest = MD4, parts = 1
31/d6cfe0d16ae931b73c59d7e0c089c0

# digest = MD5, parts = 1
d4/1d8cd98f00b204e9800998ecf8427e

# digest = SHA-1, parts = 1
da/39a3ee5e6b4b0d3255bfef95601890afd80709

# digest = SHA-256, parts = 2
e3/b0/c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

check_for_collisions

When this boolean attribute is set to true, any time the content file for a file linked into the store already exists, the files will be compared for equality before linking them. This prevents data loss in case of collisions.

The default is true to avoid data loss.

If a collision is detected, the solution is to upgrade the digest to a stronger one.

# create a MD5 store
my $md5_store = File::ContentStore->( path => $old, digest => 'MD5' );

# expose a collision
$old_store->link_file($file);    # dies

# create a new SHA-1 store
my $sha1_store = File::ContentStore->new( path => $new, digest => 'SHA-1' );

# link the old content to in the new store
# the files that were linked to the old store will be linked to the new one
$sha1_store->link_dir( $md5_store->path );
$sha1_store->link_file( $file->path );    # success!

$md5_store->path->remove_tree;            # delete the old content store

make_read_only

When this attribute is set to a true value, a chmod to remove the write permissions is performed on the content files (and therefore the linked files, since permissions are an attribute of the inode).

The default is true, to avoid unwittingly modifying linked files that were identical unbeknownst to the user.

file_callback

This optional coderef is called by "link_file" when linking a file into the store. This is useful for providing user feedback when processing large directories. The callback receives three arguments: the file, its digest and the content file (files are passed as Path::Tiny objects). It is run right after obtaining the file digest, before doing anything else.

Usage example:

File::ContentStore->new(
    path          => $dir,
    file_callback => sub {
        my ( $file, $digest, $content ) = @_;
        print STDERR "Linking $file ($digest) to $content\n";
    }
);

METHODS

new

Constructor. See "ATTRIBUTES" for valid attributes.

$store->link_file($file);

Link a single file into the content store.

$store->link_dir(@dirs);

Recursively link all the files under the given directories.

fsck

Runs a consistency check on the content store (i.e. the files under path), and returns a hash reference containing all the errors found. If no error is found, the hash reference is empty.

The types of errors found are:

empty

An array reference containing all the empty directories under path.

orphan

An array references containing Path::Tiny objects pointing to the content files with no alias (i.e. not linked to any file outside of the content store).

corrupted

An array reference of all content files for which the name does not match the digest of their content.

An array reference of all symbolic links under path.

SEE ALSO

Other modules suitable for finding duplicated files: File::Find::Duplicates, File::Same.

AUTHOR

Philippe Bruhat (BooK) <book@cpan.org>.

COPYRIGHT

Copyright 2018-2019 Philippe Bruhat (BooK), all rights reserved.

LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.