The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DataStore::CAS::FS - Virtual Filesystem backed by Content-Addressable Storage

VERSION

version 0.011000

SYNOPSIS

  # Create a new empty filesystem
  my $casfs= DataStore::CAS::FS->new(
    store => DataStore::CAS::Simple->new(
      path => './foo/bar',
      create => 1,
      digest => 'SHA-256'
    )
  );
  
  # Open an existing root directory on an existing store
  $casfs= DataStore::CAS::FS->new( store => $cas, root => $digest_hash );
  
  # --- These pass through to the $cas module
  
  $hash= $casfs->put("Blah");
  $hash= $casfs->put_file("./foo/bar/baz");
  $file= $casfs->get($hash);
  
  # Open a path within the filesystem
  $handle= $casfs->path('1','2','3','myfile')->open;
  
  # Make some changes
  $casfs->apply_path(['1', '2', 'myfile'], { ref => $some_new_file });
  $casfs->apply_path(['1', '2', 'myfile_copy'], { ref => $some_new_file });
  # Commit them
  $casfs->commit();

DESCRIPTION

DataStore::CAS::FS extends the DataStore::CAS API to support directory objects which let you store store traditional file hierarchies in the CAS, and look up files by a path name (so long as you know the hash of the root).

The methods provided allow you to traverse the virtual directory hierarchy, make changes to it, and commit the changes to create a new filesystem snapshot. The DataStore::CAS backend provides readable and seekable file handles. There is *not* any support for access control, since those concepts are system dependent. The module DataStore::CAS::FS::Fuse (not yet written) will have an implementation of permission checking appropriate for Unix.

The directories can contain arbitrary metadata, making them suitable for backing up filesystems from Unix, Windows, or other environments. You can also pick directory encoding plugins to more efficiently encode just the metadata you care about.

Each directory is serialized into a file which is stored in the CAS like any other, resulting in a very clean implementation. You cannot determine whether a file is a directory or not without the context of the containing directory, and you need to know the digest hash of the root directory in order to browse the full filesystem. On the up side, you can store any number of filesystems in one CAS by maintaining a list of roots.

The root's digest hash is affected by all the content of the entire tree, so the root hash will change each time you alter any directory in the tree. But, any unchanged files in that tree will be re-used, since they still have the same digest hash. You can see great applications of this design in a number of version control systems, notably Git.

ATTRIBUTES

store

Read-only. An instance of a class implementing DataStore::CAS.

root_entry

A DataStore::CAS::DirEnt object describing the root of the tree. Must be of type "dir". Should have a name of "", but not required. You can pick an arbitrary directory for a chroot-like-effect, but beware of broken symlinks.

root_entry refers to an **immutable** directory. If you make in-memory overrides to the filesystem using apply_path or the various convenience methods, root_entry will continue to refer to the original static filesystem. If you then commit() those changes, root_entry will be updated to refer to the new filesystem.

You can create a list of filesystem snapshots by saving a copy of root_entry each time you call commit(). They will all continue to exist within the CAS. Cleaning up the CAS is left as an exercise for the reader. (though utility methods to help with this are in the works)

case_insensitive

Read-only. Defaults to false. If set to true in the constructor, this causes all directory entries to be compared in a case-insensitive manner, and all directory objects to be loaded with case-insensitive lookup indexes.

hash_of_null

Read-only. Passes through to store->hash_of_null

hash_of_empty_dir

This returns the canonical digest hash for an empty directory. In other words, the return value of

  put_scalar( DataStore::CAS::FS::DirCodec::Minimal->encode([],{}) ).

This value is cached for performance.

It is possible to encode empty directories with any plugin, so not all empty directories will have this key, but any time the library knows it is writing an empty directory, it will use this value instead of recalculating the hash of an empty dir.

dir_cache

Read-only. A DataStore::CAS::FS::DirCache object which holds onto recently used directory objects. This object can be used in multiple CAS::FS objects to make the most of the cache.

METHODS

new

  $fs= $class->new( %args | \%args )

Parameters:

store - required

An instance of (a subclass of) DataStore::CAS

root_entry - required

An instance of DataStore::CAS::FS::DirEnt, or a hashref of DirEnt fields, or an empty hash if you want to start from an empty filesystem, or a DataStore::CAS::FS::Dir which you want to be the root directory (or a DataStore::CAS::File object that contains a serialized Dir) or or a digest hash of that File within the store.

root - alias for root_entry

get

Alias for store->get

get_dir

  $dir= $fs->get_dir( $digest_hash_or_File, \%optional_flags );

This returns a de-serialized directory object found by its hash. It is a shorthand for 'get' on the Store, and deserializing enough of the result to create a usable Dir object (or subclass).

Also, this method caches recently used directory objects, since they are immutable. (but woe to those who break the API and modify their directory objects!)

Returns undef if the digest hash isn't in the store, but dies if an error occurs while decoding one that exists.

put

Alias for store->put

put_scalar

Alias for store->put_scalar

put_file

Alias for store->put_file

put_handle

Alias for store->put_handle

validate

Alias for store->validate

path

  $path= $fs->path( @path_names )

Returns a DataStore::CAS::FS::Path object which provides frendly object-oriented access to several other methods of CAS::FS. This object does *nothing* other than curry parameters, for your convenience. In particular, the path isn't resolved until you try to use it, and might not be valid.

See "resolve_path" for notes about @path_names. Especially note that your path needs to start with the volume name, which will usually be ''. Note that you get this already if you take an absolute path and pass it to File::Spec->splitdir.

path_if_exists

  $path= $fs->path_if_exists( @path_names )

This method is like "path", but it immediately resolves the path and returns undef if the path doesn't exist. It returns the path object if it does.

tree_iterator

  $iter= $fs->tree_iterator( %optional_flags )

With no flags, creates a tree-iterator which iterates the entire tree depth-first, listing directories before their contents.

When the path flag is given, iterates the named path and everything under it (if it is a directory), or croaks if it doesn't exist.

resolve_path

  $path_array= $fs->resolve_path( \@path_names, \%optional_flags )
  $path_array= $fs->resolve_path( $path_string, \%optional_flags )

Returns an arrayref of DirEnt objects corresponding to the canonical absolute specified path, starting with the root_entry.

@path_names may contain empty strings "", which are ignored. This provides compatibility with File::Spec->splitdir, and also with calling split /\// on strings with "//" in them.

If a $path_string is given instead of an arrayref, it will be split by File::Spec->splitdir, which may or may not be what you want.

When resolving symlinks, this function operates much the same way Linux does. If the path you specify ends with a symlink, the result will be a DirEnt describing the symlink. If the path you specify ends with a symlink and a "" (equivalent of ending with a '/' or '/.'), the symlink will be resolved to a DirEnt for the target file or directory. (and if the target doesn't exist, it throws an exception, but see nodie)

Also, its worth noting that the directory objects in DataStore::CAS::FS are strictly a tree, with no back-reference to the parent directory. So, ".." in the path will be resolved by removing one element from the path. HOWEVER, this still gives you a kernel-style resolve (rather than a shell-style resolve) because if you specify "/1/foo/.." and foo is a symlink to "/1/2/3", the ".." will back you up to "/1/2/" and not "/1/".

The tree-with-no-parent-reference design is also why we return an array of the entire path, since you can't take a final directory and trace it backwards.

If the path does not exist, or cannot be resolved for some reason, this method will either return undef or die, based on whether you provided the optional nodie flag.

Flags:

no_die = $bool>

Return undef instead of dying

error_out = \$err_variable>

If set to a scalar-ref, the scalar ref will receive the error message, if any. You probably want to set 'nodie' as well.

partial = $bool>

If the path doesn't exist, any missing directories will be given placeholder DirEnt objects. You can test whether the path was resolved completely by checking whether $result->[-1]->type is defined.

mkdir = 1 || 2>

If mkdir is 1, missing directories will be created on demand.

If mkdir is 2,

get_dir_entries

  $dirent_array= $fs->get_dir_entries( \@path )

Returns an array of directory entries for the specified path.

This differs from $fs->path(@path)->dir->iterator in that you see any changes that have been made via calls to set_path or apply_path. Calling iterator on the directory object will only return what was recorded in the CAS.

readdir

  $names= $fs->readdir( \@path ); # returns arrayref in scalar context
  @names= $fs->readdir( \@path ); # returns list in list context

Convenience method for "get_dir_entries", but returns an arrayref of names (rather than "DataStore::CAS::FS::DirEnt" in DirEnt objects) and returns a list when called in list context.

set_path

  $fs->set_path( \@path, $Dir_Entry \%optional_flags )
  # returns 1, or dies

Temporarily override a directory entry at @path. If $Dir_Entry is false, this will cause @path to be unlinked. If the name of Dir_Entry differs from the final component of @path, it will act like a rename (which is the same as just unlinking the old path and creating the new path) If Dir_Entry is missing a name, it will default to the final element of @path.

path may be either an arrayref of names, or a string which will be split by File::Spec.

$Dir_Entry is either an instance of DataStore::CAS::FS::DirEnt, or a hashref of the fields to create one.

No fields of the old dir entry are used; if you want to preserve some of them, you need to do that yourself (see clone) or use the update_path() method.

If @path refers to nonexistent directories, they will be created as with a virtual "mkdir -p", and receive the default metadata of $flags{default_dir_fields} (by default, nothing) If $path travels through a non-directory (aside from symlinks, unless $flags{follow_symlinks} is set to 0) this will throw an exception, unless you specify $flags{force_create} which causes an offending directory entry to be overwritten by a new subdirectory.

Note in particluar that if you specify

  apply_path( "/a_symlink/foo", $Dir_Entry, { follow_symlinks => 0, force_create => 1 })

"a_symlink" will be deleted and replaced with an actual directory.

None of the changes from apply_path are committed to the CAS until you call commit(). Also, root_entry does not change until you call commit(), though the root entry shown by "resolve_path" does.

You can return to the last committed state by calling rollback(), which is conceptually equivalent to $fs= DataStore::CAS::FS->new( $fs->root_entry ).

update_path

  $fs->update_path( \@path, $changes, \%optional_flags )
  # returns 1, or dies

Like "set_path", but it applies a hashref (or arrayref) of $changes to the directory entry which exists at the named path. Use this to update a few attributes of a directory entry without overwriting the entire thing.

mkdir

  $fs->mkdir( \@path )

Convenience method to create an empty directory at path.

Croaks if the path already exists and is not a directory.

touch

  $fs->touch( \@path )

Convenience method to update the timestamp of the directory entry at path, possibly creating it (as an empty file)

  $fs->unlink( \@path )

Convenience method to remove the directory entry at path.

rmdir

Alias for unlink

rollback

  $fs->rollback();

Revert the FS to the state of the last commit, or the initial state.

This basically just discards all the in-memory overrides created with "apply_path" or its various convenience methods.

commit

  $fs->commit();

Merge all in-memory overrides from "apply_path" with the directories they override to create new directories, and store those new directories in the CAS.

After this operation, the root_entry will be changed to reflect the new tree.

PATH OBJECTS

Path objects are a simple wrapper around a path name array. Path objects are lazily resolved to Filesystem Nodes. This means the path object can exist even if the node does not, and a path object to "/foo" will continue to be usable even if "foo" is deleted and re-created.

Most methods of the path objects just pass-through to the Node object returned from "resolve".

path_names

Arrayref of path parts. If you created the path object from a path string like "/foo/bar", path_names will contain the result of File::Spec->splitdir: [ '', 'foo', 'bar' ].

filesystem

Reference to the DataStore::CAS::FS it was created by.

path_dirents

Returns an array of DirEnt objects resolved from this path. Throws an exception if the path does not currently exist in the filesystem.

path_name_list

Convenience list accessor for path_names arrayref

path_dirent_list

Convenience list accessor for path_dirents arrayref

dirent

Convenience accessor for final element of path_dirents

type

Convenience accessor for the type field of the final element of path_dirents

name

Convenience accessor for the name field of the final element of path_dirents

depth

Convenience accessor for the number of elements in path_dirents minus one. The root entry has a depth of 0, "/a" is a depth of one, and so on.

Note that this is counting the resolved path (after symlinks), not the logical requested path.

canonical_path

Returns a unix-notation absolute path, with extra '/' and '.' removed.

The path is not resolved, and may contain ".."

resolved_canonical_path

  $fs->path("foo","..","bar","")->resolved_canonical_path
  # where /bar is a symlink to /baz
  # returns "/baz"

Resolves the path, and then returns a canonical unix notation for it. The resolved path never ends with '/' because all symlinks have been resolved, and it would serve no purpose.

resolve

  $path->resolve()

Resolve the path, and cache the underlying nodes until the next time the Filesystem is modified.

path

  $path->path( \@sub_path )

Get a sub-path from this path. Returns another Path object.

path_if_exists

  $path->path_if_exists( \@sub_path )

Returns a path object if the subpath exists. Returns undef if not.

mkdir

Creates a directory at this path, possibly creating parent directories as well. Dies if the path passes through an existing DirEnt which is not a directory.

file

  $file= $path->file();

Returns the DataStore::CAS::File of the final element of path_dirents, or dies trying.

open

  $handle= $path->open

Alias for $path->file->open

dir

  $dir= $path->dir

Convenience method for calling "get_dir" on the file referred to by this path. Dies if this path does not reference any content, or if it is not a directory.

Note that this directory object is immutable, from the CAS, and will not reflect any changes to the filesystem until $fs->commit is called.

readdir

  $names= $path->readdir(); # returns arrayref in scalar context
  @names= $path->readdir(); # returns list in list context

Convenience method for $fs->readdir($path-path_names)

tree_iterator

  $iter= $path->tree_iterator( %optional_flags )

Convenience method for

  $path->filesystem->tree_iterator( path => $path->path_names, %flags ).

See "tree_iterator".

TREE ITERATORS

The tree iterators returned by $fs->tree_iterator and $path->iterator run a depth-first alphabetical pre-order traversal of the tree. They act as coderefs (taking no arguments) and return a Path object, or undef at the end of the iteration.

  while (my $path= $iter->()) {
    print $path->resolved_path_str, "\n"
                if $path->type ne 'dir';
  }

The iterators are also blessed objects, and have a few useful methods:

reset

  $iter->reset();

Start iteration over from the beginning.

skip_dir

  $iter->skip_dir();

skip_dir immediately ends the current subdirectory, and the next call to the iterator will return the next item from the parent directory.

The iterator "begins" a directory right before it returns it to you. So, you can prevent entering a directory like this:

  my $path= $iter->();
  if ($path->type eq 'dir' && !we_want_to_enter_dir($path)) {
    $iter->skip_dir;
  }

DIRECTORY CACHE

Directories are uniquely identified by their hash, and directory objects are immutable. This creates a perfect opportunity for caching recent directories and reusing the objects.

When you call $fs->get_dir($hash), $fs keeps a weak reference to that directory which will persist until the directory object is garbage collected. It will ALSO hold a strong reference to that directory for the next N calls to $fs->get_dir($hash), where the default is 64. You can change how many references $fs holds by setting $fs->dir_cache->size(N).

The directory cache is *not* global, and a fresh one is created during the constructor of the FS, if needed. However, many FS instances can share the same dir_cache object, and FS methods that return a new FS instance will pass the old dir_cache object to the new instance.

If you want to implement your own dir_cache, don't bother subclassing the built-in one; just create an object that meets this API:

new

  $cache= $class->new( %fields )
  $cache= $class->new( \%fields )

Create a new cache object. The only public field is size.

size

Read/write accessor that returns the number of strong-references it will hold.

clear

Clear all strong references and clear the weak-reference index.

get

  $cached_dir= $cache->get( $digest_hash )

Return a cached directory, or undef if that directory has not been cached.

put

  $dir= $cache->put( $dir )

Cache the Dir object (and return it)

UNICODE vs. FILENAMES

Background

Unix operates on the philosophy that filenames are just bytes. Much of Unix userspace operates on the philosophy that these bytes should probably be valid UTF-8 sequences (but of course, nothing enforces that). Other operating systems, like modern Windows, operate on the idea that everything is Unicode and some backward-compatible APIs exist which can represent the Unicode as Latin1 or whatnot on a best-effort basis. I think the "Unicode everywhere" philosophy is arguably a better way to go, but as this tool is primarily designed with Unix in mind, and since it is intended for saving backups of real filesystems, it needs to be able to accurately store exactly what it finds in the filesystem. Essentially this means it neeeds to be *able* to store invalid UTF-8 sequences, -or- encode the octets as unicode codepoints up to 0xFF, and later know to write them out to the filesystem as octets instead of UTF-8.

Use Cases

The primary concern is the user's experience when using this module. While Perl has decent support for Unicode, it requires all filenames to be strings of bytes. (i.e. strings with the unicode flag turned off) Any time you pass a unicode string to a Perl function like open() or rename(), perl converts it to a UTF-8 string of octets before performing the operation. This gives you the desired result in Unix. Unfortunately, Perl in Windows doesn't fare so well, because it uses Windows' non-unicode API. Reading filenames with non-latin1 characters returns garbage, and creating files with unicode strings containing non-latin1 characters creates garbled filenames. To properly handle unicode outside of latin1 on Windows, you must avoid the Perl built-ins and tap directly into the wide-character Windows API.

This creates a dilema: Should filenames be passed around the DataStore::CAS::FS API as unicode, or octets, or some auto-detecting mix? This dilema is further complicated because users of the library might not have read this section of documentation, and it would be nice if The Right Thing happened by default.

Imagine a scenario where a user has a directory named "\xDC" (U with an umlaut in latin-1) and another directory named "\xC3\x9C" (U with an umlaut in UTF-8). "readdir" will report these as the strings I've just written, with the unicode flag off. Modern Unix will render the first as a "?" and the other as the U with umlaut, because it expects UTF-8 in the filesystem.

If you have the perl string "\xDC" with the UTF-8 flag off, and you try creating that file, it will create the file names "\xDC". However if you have that same logical string with the UTF-8 flag on, it will become the file name "\x3C\x9C"!

If a user is *unaware* of unicode issues, it might be better to pass around strings of octets. Example: the user is in "/home/\xC3\x9C", and calls "Cwd". They get the string of octets "/home/\xD0". They then concatenate this string with unicode "\x{1234}". Perl combines the two as "/home/\x{C3}\x{9C}/\x{1234}", however the C3 and 9C just silently went from octets to unicode codepoints. When the user tries opening the file, it surprises them with "No such file or directory", because it tried opening "/home/\xC3\x83\xC2\x9C/\xE1\x88\xB4".

On Windows, perl is just generally broken for high-unicode filenames. Pure-ascii works fine, but ascii is a non-issue either way. Those who need unicode support will have found it from other modules, and be looking for this section of documentation.

Interesting reading for Windows: http://www.perlmonks.org/?node_id=526169

However, all this conjecture assumes a person is trying to read and write virtual items out to their filesystem. Since this module also provides that, maybe people will use the ready-built implementation and this is a non-issue.

Storage Formats

The storage format is supposed to be platform-independent. JSON seems like a good default encoding, however it requires strings to be in Unicode. When you encode a mix of unicode and octet strings, Perl's unicode flag is lost and when reading them back out you can't tell which were which. This means that if you take a unicode-as-octets filename and encode it with JSON and decode it again, perl will mangle it when you attempt to open the file, and fail. It also means that unicode-as-octets filenames will take extra bytes to encode.

The other option is to use a plain unicode string where possible, but names which are not valid UTF-8 are encoded as structures which can be restored when decoding the JSON.

Conclusion

In the end, I came up with a module called DataStore::CAS::FS::InvalidUTF8. It takes a filename in native encoding, and tries to parse it as UTF-8. If it succeeds, it returns the string. If it fails, it returns the string wrapped by InvalidUTF8, with special concatenation and comparison operators.

The directory coders are written to properly save and restore these objects.

The scanner for Windows platforms will read the UTF-16 from the Windows API, and convert it to UTF-8 to match the behavior on Unix. The Extractor on Windows will reverse this process. Extracting files with invalid UTF-8 on Windows will fail.

The default storage format uses a Unicode-only format, and a special notation to represent strings which are not unicode (See TO_JSON in InvalidUtf8. Other formats (Minimal and Unix) always store octets, and then re-detect UTF-8 when decoding the directory.

SEE ALSO

Brackup - A similar-minded backup utility written in Perl, but without the separation between library and application and with limited FUSE performance. (and rather sparse documentation)

http://git-scm.com - The world-famous version control tool

http://www.fossil-scm.org - A similar but lesser known version control tool

https://github.com/apenwarr/bup - A fantastic idea for a backup tool, which operates on top of git packfiles, but has some glaring misfeatures that make it unsuitable for general purpose use. (doesn't save metadata? no way to purge old backups??)

http://rdiff-backup.nongnu.org/ - A popular incremental backup tool that works great on the small scale but fails badly at large-scale production usage. (exit 0 sometimes even when the backup fails? chance of leaving the backup in a permanently broken state if interrupted? record deleted files... with files, causing spool directory backups to contain 600,000 files in one directory? nothing to optimize the case where a user renames a dir with 20GB of data in it?)

AUTHOR

Michael Conrad <mconrad@intellitree.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by Michael Conrad, and IntelliTree Solutions llc.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.