The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

    HPCI::File;

SYNOPSIS

An object that describes a file to be used by a stage. It includes the path to the file, whether the file must be maintained on a separate long-term storage that is different from the access path used by the stage program (and if so, how to copy the file between the long-term storage and the working storage, and whether copying can be done by the parent program or must be done by the stage).

The stage attribute files contains descriptive info about file management for the stage - most of it component sections include a file. That file can be either specified as a string, or with an object that is HPCI::File or a sub-class thereof. A bare string will normally be converted to an HPCI::File object, but the stage can contain a fileclass attribute string that over-rides that default, and additionally, the files attribute can contain a fileclass component that over-rides either of those defaults for the one stage.

ATTRIBUTES

file

The name of the file.

abs_file

The absolute pathname of the file. Not fully used yet.

sum

Boolean, indicates whether checksums are used for this file. If they are, the checksum is kept in a YAML file file.sum. This YAML file contains an array of hashes. Each hash has the keys 'type' and 'sum'. For each checksum type requested (default is 'md5' at present, will expand to include 'sha1' in the future) there is an entry in the array containing the checksum computed for the corresponding method.

sum_generate_in

A boolean value, default is false.

Specifies the action taken if the file is used for input and the file.sum checksum file is either not present or if it is older than file.

When false is specified, the stage is failed.

When TRUE is specified, the checksum is computed and FILE.sum is saved, and then the stage is allowed to run normally.

The default is FALSE to ensure that changes to input data files are done deliberately - an accidental edit should be considered an error.

You would set the value to TRUE when first receiving a newly downloaded file from an outside source. When a file is created as an 'out' file, the sum is always created (if the sum attribute is true), so the default of FALSE does not cause problems for later stages.

sum_validate_in

This can be given a string ('timestamp', 'once', 'always') to indicate how vigourously the checksum is validated.

The default is 'once'.

The setting 'timestamp' accepts the file as valid if the file.sum files exists and is newer than file. (If it is older, then sum_generate_in controls how it is handled.)

The setting 'once' loads the file.sum data, and verifies the checksum(s) explicitly the first time the file is used for input, but accepted as valid after that point of the tiemstamps have not changed. (This avoids recomputing the checksum(s) for every stage that reuses the same file.

The setting 'always' validates the checksum(s) for every stage that uses the file.

_shared

This is an internal attribute that specifies that the file is located on a file system that is shared by all of the nodes in the cluster. This value is over-ridden by subclasses of HPCI::File which provide for files which are not on a shared file system. Such subclasses must provide for get and put methods to copy files between the local filer system and a repository that is accessible from all nodes (although not necessarily on the file system). They will also define their own addition attributes as needed to provide the details of accessing the

Storage management

HPCI has two types of storage that it can deal with.

Long-term storage is storage that is reliably preserved over time. The data in long-term storage is accessible to all nodes on the cluster.

Working storage is storage that is directly-accessible to a node. It can be a private disk that is not accessible to other nodes, or it can be a shared file system that is available to other nodes.

Some types of cluster can have their nodes rebuilt at intervals, losing the data on their local disks.

Some types of cluster have no shared file system, or have size limitations on their shared file system.

There are three scenarios:

fully-shared

When there is a reliable, fully-shared file system that has no limitations to prevent it being used for the data, then files can be on that file system. The same path will refer to the long-term and working locations.

partially-shared

When there is a fully-shared file system that is not reliable for long-term storage, it might be used as working storage. That allows the parent process to carry out operations on the files for the stage, and allow skipping the download of files that are still present when the storage has not been cleared.

node-private

When it is necessary to use storage that is only accessible by the individual node for working storage, the the stage program must carry out all upload and download operations, as well as any validation checks that require data access to the file.

This class has two methods that specify the storage scenario that it provides.

method has_long_term

The method has_long_term specifies whether this file uses a separate long-term storage facility that is distinct from the working storage that will be used by the stage for direct reading and/or writing.

method has_shared_storage

The method has_shared_storage specifies whether the working storage location used by the stage is also accessible to the parent program and/or other stages.

Settings to indicate storage scenario

                         has_long_term   has_shared_storage
                       +---------------+--------------------+
    fully-shared       |     false     |       true         |
                       +---------------+--------------------+
    partially-shared   |     true      |       true         |
                       +---------------+--------------------+
    node-private       |     true      |       false        +
                       +---------------+--------------------+

Subclasses of this class will over-ride these methods to specify alternative values appropriate to the particular subclass.

Note that setting both methods to return false is not a workable process - that would imply that there is no storage type that allows passing data between stages.