The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DataStore::CAS::FS::Importer - Copy files from filesystem into DataStore::CAS::FS.

VERSION

version 0.011000

SYNOPSIS

  my $cas_fs= DataStore::CAS::FS->new( ... );
  
  # Defaults are reasonable
  my $importer= DataStore::CAS::FS::Importer->new();
  $importer->import_tree( "/home/user", $cas_fs->path('/') );
  $cas_fs->commit();
  
  # Lots of customizability...
  $importer= DataStore::CAS::FS::Importer->new(
    dir_format => 'unix',   # optimized for storing unix-attrs
    filter => sub { return ($_[0] =~ /^\./)? 0 : 1 }, # exclude hidden files
    die_on_file_error => 0, # store placeholder for files that can't be read
  );

DESCRIPTION

The Importer is a utility class which performs the work of scanning directory entries of the real filesystem, storing new files in the CAS, and encoding new directories and storing those in the CAS as well. It has conditional support for the various Perl modules you need to collect all the metadata you care about, and can be subclassed if you need to collect additional metadata.

ATTRIBUTES

dir_format

  $class->new( dir_format => 'universal' );
  $importer->dir_format( 'unix' );

Read/write. Directory format to use when encoding directories. Defaults to 'universal'.

Directories can be recorded with varying levels of metadata and encoded in a variety of formats which are optimized for various uses. Set this to the format string of your preferred encoder.

The format strings are registered by DirCodec classes when loaded. Built-in formats are 'universal', 'minimal', or 'unix'. (more are planned)

Calls to "import_tree" will encode directories in this format. If you wish to re-use the previously encoded directories during an incremental backup, you must use the same dir_format as before. This is because all directories get re-encoded every time, and the ones containing the same metadata will end up with the same digest-hash, and be re-used.

filter

Read/write. This optional coderef (which may be an object with overloaded function operator) filters out files that you wish to ignore when walking the physical filesystem.

It is passed 3 arguments: The name, the full path, and the results of 'stat' as a blessed arrayref. You are also guaranteed that stat was called on this file immediately preceeding, so you may also use code like "-d _".

Return 0 to exclude the file. Return 1 to store it. Return -1 to record its metadata (directory entry) but not its content.

  $importer->filter( sub {
    my ($name, $path, $stat)= @_;
    return 1 if -d _;                     # recurse into all directories
    return -1 if $stat->size > 1024*1024; # don't store large files
    return 0 if substr($name,0,1) eq '.'; # exclude hidden files
    return 1;
  });

flags

Read/write. This is a hashref of parameters and options for how directories should be scanned and which information is collected. Each member of 'flags' has its own accessor method, but they may be accessed here for easy swapping of entire parameter sets. All flags are read/write, and most are simple booleans.

die_on_dir_error

true: Die if there is any problem reading the contents of a directory. false: Warn, and encode as a content-less directory.

Default: true

die_on_file_error

true: Die if there is any problem reading the contents of a file. false: Warn, and encode as a content-less file.

Default: true

die_on_hint_error

true: Die if there is an error looking up the "hint" for an incremental backup. false: Warn that the hint is unavailable, and just encode the file/directory as if no hint were being used.

Default: false

collect_metadata_ts

Default: true, if available and distinct from mtime.

If true, collect metadata_ts, which is the timestamp of the last change to the file's metadata. (ctime, on UNIX)

collect_access_ts

Default: false

If true, collects attribute unix_atime

This value is not collected by default because it changes frequently, many people don't use it anyway, and the Importer itself is likely to modify them.

collect_unix_perm

Default: true on unix

If true, collects attributes mode, unix_uid, unix_gid, unix_user, and unix_group.

collect_unix_misc

Default: false

If true, collects attributes unix_dev, unix_inode, unix_nlink, unix_blocksize, and unix_blockcount.

collect_acl

Default: false

If true, would collect attribute unix_acl or windows_acl (neither of which are currently unimplemented, or have even been spec'd out)

collect_ext_attr

Default: false

If true, collects any "extended metadata" available for the file. This is unimplemented and attributes have not been spec'd out yet.

Default: false.

Use lstat instead of stat. Use this flag at your own risk. It might introduce recursion, and no code has been written yet to detect and prevent this. No symlinks will be recorded as symlinks if this is set.

The interaction of this flag with an incremental backup that contains symlinks (i.e. whether to follow symlinks within the "hint" directory) is unspecified. (I need to spend some time thinking about it before I can decide which makes the most sense)

cross_mountpoints

Default: false

Cross mount points. Leaving this as false will record mount points as a content-less directory. Mount points are detected by the device number changing in a call to stat. This is not robust protection against bind-mounts, however. Support for detecting bind-mounts might be added in the future.

reuse_digests

Default: 2

Options: false (off), 1 (size), 2 (size+mtime), 3 (size+ctime)

Many of the import methods accept a $hint parameter. Using digest hints greatly speeds up import operations, at the cost of the certainty of getting an exact copy.

The hint is a past result of importing a tree from the filesystem. (a path object from DataStore::CAS::FS). If the size (and optionally metadata_ts / modify_ts) of the file have not changed, the digest_hash from the hint will be used instead of re-calculating it.

Make sure you are collecting and storing your criteria in the directories, or none of the hashes can be re-used. Specifically, you need collect_metadata_ts => 1 and dir_format => 'unix' or dir_format => 'universal' to make use of reuse_digests => 3.

utf8_filenames

METHODS

new

  my $importer= $class->new( %attributes_and_flags )

The constructor accepts values for any of the official attributes. It also accepts all of the flag names, and will move them into the flags attribute for you.

No arguments are required, and the defaults should work for most people.

import_tree

  $self->import_tree( $path, $FS_Path_object )
  # returns true, or throws an exception

Recursively collect directory entries from the real filesystem at $path and store them at $FS_Path_object (which references an instance of FS, which references an instance of CAS)

This will use the destination path for incremental-bakup hints, if that feature is enabled on this Importer. If you want to make a clean import, you should first unlink the destination path, or turn off the "reuse_digests" flag.

import_directory

  $digest_hash= $importer->import_directory( $cas, $path, $hint );

Imports a directory from the real filesystem $path into the $cas, optionally using the virtual filesystem path $hint as a cache of previously-calculated digest hashes for files whose metadata matches.

import_directory_entry

  $dirEnt= $importer->import_directory_entry($cas, $path);
  # Or a little more optimized...
  $dirEnt= $importer->import_directory_entry($cas, $path, $ent_name, $stat, $hint);

This method scans a path on the real filesystem, and returns a *complete* DirEnt object, importing file contents and recursing and encoding subdirectories as necessary.

collect_dirent_metadata

  $attrHash= $importer->collect_dirent_metadata( $path );
  # -or-
  $attrHash= $importer->collect_dirent_metadata( $path, $hint, $name, $stat );

This method returns a hashref of attributes about the named file. The only required parameter is $path, however the others can be given to speed up execution. $path should be in platform-native form. $name will be calculated with File::Spec->splitpath if not provided. $stat should be an arrayref from stat() or lstat(), optionally blessed.

If $hint (a DirEnt) is given, and $path refers to a file with the same metadata (size, mtime) of the $hint, then $hint-ref> will be used instead of re-calculating the digest of the file.

STAT OBJECTS

The stat arrayrefs that Importer passes to the filter are blessed to give you access to methods like '->mode' and '->mtime', but I'm not using File::stat. "Why??" you ask? because blessing an arrayref from the regular stat is 3 times as fast and my accessors are twice as fast, and it requires a miniscule amount of code.

AUTHOR

Michael Conrad <mconrad@intellitree.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by Michael Conrad, and IntelliTree Solutions llc.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.