The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DataStore::CAS - Abstract base class for Content Addressable Storage

VERSION

version 0.08

SYNOPSIS

  # Create a new CAS which stores everything in plain files.
  my $cas= DataStore::CAS::Simple->new(
    path   => './foo/bar',
    create => 1,
    digest => 'SHA-256',
  );
  
  # Store content, and get its hash code
  my $hash0= $cas->put($something_ambiguous);
  my $hash1= $cas->put_scalar($data_bytes);
  my $hash2= $cas->put_file($filename);
  my $hash3= $cas->put_handle(\*STDIN);
  
  my $writer= $cas->new_write_handle;
  for (1..10) {
    $writer->print($_);
  }
  my $hash4= $writer->commit;
  
  # Retrieve a reference to that content, or undef for unknown hash
  my $casfile= $cas->get($hash);
  
  # Inspect the file's attributes
  say "File is " . $casfile->size . " bytes";
  
  # Open a handle to that file (possibly returning a virtual file handle)
  my $handle= $casfile->open;
  my @lines= <$handle>;

DESCRIPTION

This module lays out a very straightforward API for Content Addressable Storage.

Content Addressable Storage is a concept where a file is identified by a one-way message digest checksum of its content. (usually called a "hash") With a good message digest algorithm, one checksum will statistically only ever refer to one file, even though the permutations of the checksum are tiny compared to all the permutations of bytes that they can represent.

In short, a CAS is a key/value mapping where the key is determined from the value, and thanks to astronomical probability, every value will get a distinct key. You can then use the key as a shorthand reference for the value. Most importantly, every CAS using the same digest algorithm will generate the same key for a value, without a central server coordinating it.

This is a Role, requiring the implementing class to provide attribute "digest", and methods "get", "commit_write_handle", "delete", "iterator", and "open_file".

Note: Perl uses the term 'hash' to refer to key/value hash tables, which creates a little confusion. In fact, the key-hashing part of hash tables is nearly the same concept as a CAS except that hash tables use a tiny digest function that often does collide with other keys. The documentation of this and related modules try to use the phrase "digest hash" to clarify when talking about the output of a digest function vs. a perl hash table.

PURPOSE

One great use for CAS is finding and merging duplicated content. If you take two identical files (which you didn't know were identical) and put them both into a CAS, you will get back the same digest hash, telling you that they are the same. Also, the file will only be stored once, saving disk space.

Another great use for CAS is for remote systems to compare an inventory of files and see which ones are absent on the other system. This has applications in backups and content distribution.

ATTRIBUTES

digest

Read-only. The name of the digest algorithm being used.

Implementors must provide this constant from the time they are constructed.

The algorithm should be available from the Digest module, or else the subclass will need to provide a few additional methods like "calculate_hash".

hash_of_null

The digest hash of the empty string. CAS instances should always have this file available, to be used as a test whether the CAS is functioning.

METHODS

get

  $cas->get( $digest_hash )

Returns a DataStore::CAS::File object for the given hash, if the hash exists in storage. Else, returns undef.

This method is pure-virtual and must be implemented in the subclass.

put

  $cas->put( $thing, \%optional_flags )

Convenience method. Inspects $thing and passes it off to a more specific method. If you want more control over which method is called, call it directly.

The %optional_flags can contain a wide variety of parameters, but these are supported by all CAS subclasses:

dry_run => $bool

Setting "dry_run" to true will calculate the hash of the $thing, and go through the motions of writing it, but not store it.

known_hashes => \%digest_hashes
  { known_hashes => { SHA1 => '0123456789...' } }

Use this to skip calculation of the hash. The hashes are keyed by Digest name, so it is safe to use even when the store being written to might not use the same digest that was already calculated.

Of course, using this feature can corrupt your CAS if you don't ensure that the hash is correct.

stats => \%stats_out

Setting "stats" to a hashref will instruct the CAS implementation to return information about the operation, such as number of bytes written, compression strategies used, etc. The statistics are returned within that supplied hashref. Values in the hashref are amended or added to, so you may use the same stats hashref for multiple calls and then see the summary for all operations when you are done.

The return value is the hash checksum of the stored data, regardless of whether it was already present in the CAS.

Example:

  my $stats= {};
  $cas->put("abcdef", { stats => $stats });
  $cas->put(\$large_buffer, { stats => $stats });
  $cas->put(IO::File->new('~/file','r'), { stats => $stats });
  $cas->put(\*STDIN, { stats => $stats });
  $cas->put(Path::Class::file('~/file'), { stats => $stats });
  use Data::Printer;
  p $stats;

put_scalar

  $cas->put_scalar( $scalar, \%optional_flags )
  $cas->put_scalar( \$scalar, \%optional_flags )

Puts the literal string "$scalar" into the CAS, or the scalar pointed to by a scalar-ref. (a scalar-ref can help by avoiding a copy of a large scalar) The scalar must be a string of bytes; you get an exception if any character has a codepoint above 255.

Returns the digest hash of the array of bytes.

See "put" for the discussion of %flags.

put_file

  $digest_hash= $cas->put_file( $filename, \%optional_flags );
  $digest_hash= $cas->put_file( $Path_Class_File, \%optional_flags );
  $digest_hash= $cas->put_file( $DataStore_CAS_File, \%optional_flags );

Insert a file from the filesystem, or from another CAS instance. Default implementation simply opens the named file, and passes it to put_handle.

Returns the digest hash of the data stored.

See "put" for the discussion of standard %flags.

Additional flags:

move => $bool

If move is true, and the CAS is backed by plain files on the same filesystem, it will move the file into the CAS, possibly changing its owner and permissions. Even if the file can't be moved, put_file will attempt to unlink it, and die on failure. Note: If you use this option with a File::Temp object, this closes the file handle to ensure that no further writes or fd-operations are applied to the file which is now part of your read-only CAS.

If hardlink is true, and the CAS is backed by plain files on the same filesystem by the same owner and permissions as the destination CAS, it will hardlink the file directly into the CAS.

This reduces the integrity of your CAS; use with care. You can use the "validate" method later to check for corruption.

reuse_hash => $bool

This is a shortcut for known_hashes if you specify an instance of DataStore::CAS::File. It builds a known_hashes of one item using the source CAS's digest algorithm.

Note: A good use of these flags is to transfer files from one instance of DataStore::CAS::Simple to another.

  my $file= $cas1->get($hash);
  $cas2->put($file, { hardlink => 1, reuse_hash => 1 });

put_handle

  $digest_hash= $cas->put_handle( \*HANDLE | IO::Handle, \%optional_flags );

Reads from $io_handle and stores into the CAS. Calculates the digest hash of the data as it goes. Does not seek on handle, so if you supply a handle that is not at the start of the file, only the remainder of the file will be added and hashed. The handle is forced into binary mode. Dies on any I/O errors.

Returns the calculated digest hash when complete.

See "put" for the discussion of flags.

new_write_handle

  $handle= $cas->new_write_handle( %flags )

Get a new handle for writing to the Store. The data written to this handle will be saved to a temporary file as the digest hash is calculated.

When done writing, call either $cas-commit_write_handle( $handle )> (or the alias $handle-commit()>) which returns the hash of all data written. The handle will no longer be valid.

If you free the handle without committing it, the data will not be added to the CAS.

The optional 'flags' hashref can contain a wide variety of parameters, but these are supported by all CAS subclasses:

dry_run => $bool

Setting "dry_run" to true will calculate the hash of the $thing, but not store it.

stats => \%stats_out

Setting "stats" to a hashref will instruct the CAS implementation to return information about the operation, such as number of bytes written, compression strategies used, etc. The statistics are returned within that supplied hashref. Values in the hashref are amended or added to, so you may use the same stats hashref for multiple calls and then see the summary for all operations when you are done.

Write handles will probably be an instance of FileCreatorHandle.

commit_write_handle

  my $handle= $cas->new_write_handle();
  print $handle $data;
  $cas->commit_write_handle($handle);

This closes the given write-handle, and then finishes calculating its digest hash, and then stores it into the CAS (unless the handle was created with the dry_run flag). It returns the digest hash of the data.

calculate_hash

Return the hash of a scalar (or scalar ref) in memory.

calculate_file_hash

Return the hash of a file on disk.

validate

  $bool_valid= $cas->validate( $digest_hash, \%optional_flags )

Validate an entry of the CAS. This is used to detect whether the storage has become corrupt. Returns 1 if the hash checks out ok, and returns 0 if it fails, and returns undef if the hash doesn't exist.

Like the "put" method, you can pass a hashref in $flags{stats} which will receive information about the file. This can be used to implement mark/sweep algorithms for cleaning out the CAS by asking the CAS for all other digest_hashes referenced by $digest_hash.

The default implementation simply reads the file and re-calculates its hash, which should be optimized by subclasses if possible.

delete

  $bool_happened= $cas->delete( $digest_hash, %optional_flags )

DO NOT USE THIS METHOD UNLESS YOU UNDERSTAND THE CONSEQUENCES

This method is supplied for completeness... however it is not appropriate to use in many scenarios. Some storage engines may use referencing, where one file is stored as a diff against another file, or one file is composed of references to others. It can be difficult to determine whether a given digest_hash is truly no longer used.

The safest way to clean up a CAS is to create a second CAS and migrate the items you want to keep from the first to the second; then delete the original CAS. See the documentation on the storage engine you are using to see if it supports an efficient way to do this. For instance, DataStore::CAS::Simple can use hard-links on supporting filesystems, resulting in a very efficient copy operation.

If no efficient mechanisms are available, then you might need to write a mark/sweep algorithm and then make use of 'delete'.

Returns true if the item was actually deleted.

The optional 'flags' hashref can contain a wide variety of parameters, but these are supported by all CAS subclasses:

dry_run => $bool

Setting "dry_run" to true will run a simulation of the delete operation, without actually deleting anything.

stats => \%stats_out

Setting "stats" to a hashref will instruct the CAS implementation to return information about the operation within that supplied hashref. Values in the hashref are amended or added to, so you may use the same stats hashref for multiple calls and then see the summary for all operations when you are done.

delete_count

The number of official entries deleted.

delete_missing

The number of entries that didn't exist.

iterator

  $iter= $cas->iterator( \%optional_flags )
  while (defined ($digest_hash= $iter->())) { ... }

Iterate the contents of the CAS. Returns a perl-style coderef iterator which returns the next digest_hash string each time you call it. Returns undef at end of the list.

%flags :

prefix

Specify a prefix for all the returned digest hashes. This acts as a filter. You can use this to imitate Git's feature of identifying an object by a portion of its hash instead of having to type the whole thing. You will probably need more digits though, because you're searching the whole CAS, and not just commit entries.

open_file

  $handle= $cas->open_file( $fileObject, \%optional_flags )

Open the File object (returned by "get") and return a readable and seekable filehandle to it. The filehandle might be a perl filehandle, or might be a tied object implementing the filehandle operations.

Flags:

layer (TODO)

When implemented, this will allow you to specify a Parl I/O layer, like 'raw' or 'utf8'. This is equivalent to calling 'binmode' with that argument on the filehandle. Note that returned handles are 'raw' by default.

THANKS

Portions of this software were funded by Clippard Instrument Laboratory. Thanks for supporting Open Source.

AUTHOR

Michael Conrad <mconrad@intellitree.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2023 by Michael Conrad, and IntelliTree Solutions llc.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.