The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

OVERVIEW

File::Collector and its companion module File::Collector::Processor are base classes designed to make it easier to create custom modules for classifying and processing a collection of files as well as generating and processing data related to files in the collection.

For example, let's say you need to import raw files from one directory into some kind of repository. Let's say that files in the directory need to be filtered and the content of the files needs to be parsed, validated, rendered and/or changed before getting imported. Complicating things further, let's say that the name and location of the file in the target repository is dependent upon the content of the files in some way. Oh, and you also have to check to make sure the file hasn't already been imported.

This kind of task can be acomplished with a series of one-off scripts that process and import your files with each script producing output suitable for the next script. But if such imports involve a high level of complexity, running separate scripts for each processing stage can be slow, tedious, error-prone and a headache to maintain and organize.

The File::Collector and File::Collector::Processor base modules can help you set up a chain of modules to combine a series of workflows into a single logical package that will make complicated file processing more robust, testable, and much simpler to code.

SYNOPSIS

There are three steps to using File::Collector. First, you create your Collector classes, one for each stage of your file processing. Next, you create Processor classes, one for each of your Collector classes. Finally, you write a simple script to actually do the processing.

Step 1: Create the Collector classes

  package File::Collector::YourCollector;
  use strict; use warnings;

  # Here we add in the package containing the processing methods associated
  # with the Collector (see below).
  use File::Collector::YourCollector::Processor;

  # Objects can store information about the files in the collection which
  # can be accessed by other Collector and Processor classes.
  use SomeObject;

  # Add categories for file collections with the _init_processors method. These
  # categories are used as labels for Processor objects which contain
  # information about the files and can run methods on them. In the example
  # below, we add two file collection categories, "good" and "bad."
  sub _init_processors {
    return qw ( good bad );
  }

  # Next we add a _classify_file method that is called once for each file
  # added when constructing our Collector object. The primary job of this
  # method is to add files and any associated objects to a Processor for
  # further processing.
  sub _classify_file {
    my $s = shift;

    # First, we create an object and associate it with our file using the
    # _add_obj method. There is no requirement that you create objects but they
    # will make data about your file easily available to other classes.
    # Offloading as much logic as possible to objects will keep classes simple.

    # Note how we pass the name of the current file being processed to the
    # object by using the "selected" method which intelligently generates the
    # full path to the file currently being processed by _classify_file. Also
    # note that we don't have to bother passing the name of the file to
    # _add_obj method since this method can figure out which file is beting
    # processed by calling the "selected" method as well.
    my $data = SomeObject->new( $s->selected );
    $s->_add_obj('data', $data);

    # Now that we know something about our file, we can classify the files
    # according to any criteria of our choosing.
    # to a processor category
    if ( $data->{has_good_property} ) {
      $s->_classify('good');
    } else {
      $s->_classify('bad');
    }
  }

  # Finally, the _run_processes method contains method calls to your
  # Processor methods.
  sub _run_processes {
    my $s = shift;

    # Below are the methods we can run on the files in our collection. The
    # "good_files" method returns the collection of files classified as "good"
    # and the "do" method is a method that automatically iterates over the
    # files. The "modify" method is one of the methods in our Processor class
    # (see below).
    $s->good_files->do->modify;

    # Run methods on files classified as "bad"
    $s->bad_files->do->fix;
    $s->bad_files->do->modify;

    # You can call methods found in any of the earlier Processor classes you
    # run in your chain.
    $s->good_files->do->move;
    $s->bad_files->do->move;
  }

Step 2: Create your Processor classes.

  # Your Processor class must have the same package name as the Collector
  # class but with "::Processor" tacked on to the end.
  package File::Collector::YourCollector::Processor;

  # This line is required to get access to the methods from the base class.
  use parent 'File::Collector::Processor';

  # This custom method is run once for each file in a collection when we use
  # the "do" method.
  sub modify {
    my $s = shift;

    # Skip the file if it has already been processed.
    next if ($s->attr_defined ( 'data', 'processed' ));

    # Properties of objects added by Collector classes can be easily accessed.
    my @values = $s->get_obj_prop ( 'data', 'header_values' );

    # You can call methods found insided objects, too. Here we run the
    # add_header() method on the data object and pass on values to it.
    $s->obj_meth ( 'data', 'add_header', \@values );
  }

  # We can add as many additional custom methods as we need.
  sub fix {
    ...
  }

Step 3: Construction the Collector

Once your classes have been created, you can run all of your collectors and processors simply by constructing a Collector object.

The constructor takes three types of arguments: a list of the files and/or directories you want to collect; an array of the names of the Collector classes you wish to use in the order you wish to employ them; and finally, an option hash, which is optional.

   my $collector = File::Collector::YourClassifier->new(
     # The first arguments are a list of resources to be added
     'my/dir', 'a_file.txt'

     # The second argument is an array of Collector class names listed in the
     # same order you want them to run
     [ 'File::Collector::First', 'File::Collector::YourCollector'],

     # Finally, an optional hash argument for options can be supplied
     { recurse => 0 });

   # The C<$collector> object has some useful methods:
   $collector->get_count; # returns total number of files in the collection

   # Convenience methods with a little under-the-hood magic make it painless to
   # iterate over files and run methods on them.
   while ($collector->next_good_file) {
     $collector->print_short_name;
   }

   # Iterators can be easily created from C<Processor> objects:
   my $iterator = $s->get_good_files;
   while ( $iterator->next ) {
     # run C<Processor> methods and do other stuff to "good" files
     $iterator->modify_file;
   }

DESCRIPTION

  my $collector = File::Collector->new( 'my/directory',
                                        [ 'Custom::Classifier' ]
                                        { recurse => 0 } );

Creates a Collector object to collect files from the directories and files in the argument list. Once collected, the files will be processed by each of the @custom_collector_classes in the order supplied by an array argument. An option hash can be supplied to turn directory recursion off with by setting recurse to false.

new returns an object which contains all the files, their processing classes, and any data you have associated with the files. This object has serveral methods that can be used to inspect the object.

  $collector->add_resources( 'myfile1.txt', '/my/home/dir/files/', ... );

Adds additional file resources to an existing collection and processes them. This method accepts no option hash and the same one supplied to the new constructor is used.

  $collector->get_count;

Returns the total number of files in the collection.

  my @all_files = $collector->get_files;

Returns a list of the full path of each file in the collection.

  my $file = $collector->get_file( '/full/path/to/file.txt' );

Returns a reference of the data and objects associated with a file.

Prints the full path names of each file in the collection, sorted alphabetically, to STDOUT.

Same as list_files_long but prints the files' paths relative to the top level directory shared by all the files in the collections.

  while ($collector->next_good_file) {
    my $file = $collector->selected;
    ...
  }

Retrieves the first file from the the collection of files indicated by FILE_CATEGORY. Each subsequent next call iterates over the list of files. Returns a boolean false when the file is exhausted. Provides an easy way to iterate over files and perform operations on them.

FILE_CATEGORY must be a valid processor name as supplied by one of the _init_processors method.

  my $processor = $collector->good_files;

Returns the File::Processor object for the category indicated by FILE_CATEGROY.

Similar to FILE_CATEGORY_files() except a shallow clone of the File::Processor object is returned. Useful if you require separate iterators for files in the same category.

FILE_CATEGORY must be a valid processor name as supplied by one of the _init_processors method.

  sub _init_processors {
    return qw ( 'category_1', 'category_2' );
  }

Creates new file categories. Internally, this method adds a new Processor object to the Collector for each category added so that Processor methods from custom Processor classes can be run on individual categories of files.

  sub _classify_file {
    my $s = shift;

    # File classifying and analysis logic goes here
  }

Use this method to classify files and to associate objects with your files using the methods provided by the Collector class. This method is run once for each file in the collection.

  sub _run_processes {
    my $s = shift;

    # Processor method calls go here
  }

In this method, you should place various calls to Processors methods.

This method is typically called from within the _classify_file method. It adds the file currently getting pocessed to a collection of $category_name files contained within a Processor object which, in turn, belongs to the Collector object. The $category_name must match one of the processor names provided by the _init_processor methods.

Like the _classify method, this method is typically called from within the _classify_file method. It associates the object specified by $object to an arbitrary name, specified by $object_name, with the file currently getting processed.

Returns a boolean value reflecting whether the file being iterated over belongs to a category.

Returns the contents of an object's property.

Returns an object associated with a file.

Sets an object's property.

Runs the $method_name method on the object specified in $obj_name. Arguments are passed via $method_args.

Retrieves the name of file being processed without the path.

Returns the full path and file name of the file being processed.

Returns a boolean value reflecting the existence of the obj in $obj_name.

Returns a boolean value reflecting if the atrribute specified by $attr_name is defined in the $obj_name object.

Returns a shortened path, relative to all the files in the entire collection, and the file name of the file being processed or in an iterator.

CONFIGURATION AND ENVIRONMENT

Requires no configuration files or environment variables.

SEE ALSO

File::Collector::Processor

27 POD Errors

The following errors were encountered while parsing the POD:

Around line 179:

Unknown directive: =regmethod

Around line 181:

Unknown directive: =regmethod

Around line 183:

Unknown directive: =regmethod

Around line 199:

Unknown directive: =regmethod

Around line 207:

Unknown directive: =regmethod

Around line 213:

Unknown directive: =regmethod

Around line 219:

Unknown directive: =regmethod

Around line 225:

Unknown directive: =regmethod

Around line 230:

Unknown directive: =regmethod

Around line 235:

Unknown directive: =regmethod

Around line 250:

Unknown directive: =regmethod

Around line 256:

Unknown directive: =regmethod

Around line 263:

Unknown directive: =primethod

Around line 273:

Unknown directive: =primethod

Around line 285:

Unknown directive: =primethod

Around line 295:

Unknown directive: =primethod

Around line 303:

Unknown directive: =primethod

Around line 310:

Unknown directive: =regmethod

Around line 315:

Unknown directive: =itmethod

Around line 319:

Unknown directive: =itmethod

Around line 323:

Unknown directive: =itmethod

Around line 327:

Unknown directive: =itmethod

Around line 332:

Unknown directive: =itmethod

Around line 336:

Unknown directive: =itmethod

Around line 340:

Unknown directive: =itmethod

Around line 344:

Unknown directive: =itmethod

Around line 348:

Unknown directive: =itmethod