The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

    HPCI::Group

SYNOPSIS

Role for building a cluster-specific driver for a group of stages. This should only be used internally to the HPCI module - code that uses this driver will not load this module (or the driver module) explicitly.

It describes the user interface for a generic group, hiding (as much as possible) the specifics of the actual cluster that is being used. The driver module that consumes this role will arrange to translate the generic interface into the particular interface conventions of the specific cluster that it accesses.

An (internally defined) cluster-specific group object is defined with:

    package HPCD::$cluster::Group;
    use Moose;

    ### required method definitions

    with 'HPCI::Group' => { StageClass => 'HPCD::$cluster::Stage' },
        # any other roles required ...
        ;

    ### cluster-specific method definition if any ...

DESCRIPTION

This role provides the generic interface for a group object which can configure and run a collection of stages (jobs) on machines in a cluster. It is written to be independent of the specifics of any particular cluster interface. The cluster-specific module that consumes this role is not accessed directly by the user program - they are provided with a group driver object of the appropriate cluster-specific type using the "class method" HPCI->group (with an appropriate cluster argument) to request an appropriate to build it.

ATTRIBUTES

name (optional)

The name of this group of stages. Defaults to 'default_group_name'. Not actually required for a group.

cluster

The type of cluster that will be used to execute the group of stages. This value is passed on internally by the HPCI->stage method when it creates a new stage, or by the group->subgroup method when it creates a new subgroup. Since it also uses that value to select the type of group object that is created, it is somewhat redundant.

base_dir (optional)

The directory that will contain all generated output (unless that output is specifically directed to some other location). The default is the current directory.

storage_classes (optional) (Not currently used, needs work)

HPCI has two conceptual storage types that it expects to be available.

Long-term storage is storage that is reliably preserved over time. The data in long-term storage is accessible to all nodes on the cluster, but possibly only accessible through special commands.

Working storage is storage that is directly-accessible to a node through normal file system access methods. It can be a private disk that is not accessible to other nodes, or it can be a shared file system that is available to other nodes.

It is fairly common, and most convenient, if the working storage also qualifies as long-term storage. That is the default expectation if HPCI is not told otherwise.

However, some types of cluster can have their nodes rebuilt at intervals, losing the data on their local disks. Some types of cluster have a shared file system that is not strongly persistent, but which can be rebuilt at intervals. Some types of cluster have shared file systems that have size limitations that mean that some or all of the data sets for stage processes cannot be stored there.

In such cases, some or all of the files must have a long-term storage location that is different from the more convenient working storage location that will be used when stages are running. Depending upon the environment, the long-term storage will use something like:

network accessed storage

A storage facility that allows file to be uploeded and downloaded through a netowrk connection (from any node in the cluster).

parent managed storage

The job controlling program may have long-term storage of it own that is not accessible to other nodes in the cluster. If there is a large enough shared file system (that for some reason cannot be used as long-term storage) the parent HPCI program can copy files between that storage and the shared storage as needed to make the files available and to preserve the results.

bundled storage

In a cloud layout there is often no file system shared amongst all of the nodes and the parent process. In this type of cluster, a submitted stage will include some sort of bundle containing a collection of data and control files (possibly even the entire operating system) to be used by the stage, and a similar sort of bundle is recovered to provide the results of running a stage. (This could be, for example, a docker image.)

The attribute storage_classes defines the available storage classes that can be used for stage files.

In most cases, all files for all stages will be of the same storage class, but some cluster configurations will have multiple storage choices and can have, for some stages, the need to use more than one of the storage classes for different files within the same job.

To cater to both of these, the storage_classes attribute can either be a single arrayref which will be used for all files, or it can be a hash of named arrayrefs, with the name being used to select the class for each individual file. A file (described in the files attribute of the stage class) can either be a scalar string specifying the pathname of the working storage location that will be used, or it can be a one element hashref, with the key selecting the storage class and the value providing the working storage pathname. If the hash of named arrayrefs is used, one of the elements of the hash should have the key default - that will be used for files which do not provide an explicit storage class key.

The default value for this attribute is:

    [ 'HPCI::File ]

The HPCI::File class defines the usage for the common case in which there is no need for a long-term storage area that is different from the working storage area.

Classes that provide to separate long-term storage area will usually require additional arguments, for specifying access control information (such as url, username, password) and how to map the working storage pathname into the corresponding location in the long-term storage.

See the documentation for HPCI::File for details on writing new classes.

file_class

The default storage class attribute for files that do not have an explicit class given. This is the name of a class. The default is HPCI::File, but a sub-class of HPCI::File can be provided instead.

_default_file_info (internal)

This over-rides the _default_file_info method from HPCI::Super which sets the default contents of the file_params attribute. HPCI::Super normally copies from the parent group, but since this top-level group has no parent it is set to an empty hash.

_unique_name (internal)

The name of the top level group with a timestamp added to make it unique so that the directory created to hold to files from the execution of the group will not conflict with other runs.

connect (optional)

This can contain a value to be used by the driver for types of cluster where it is necessary to connect to the cluster in some way. It can be omitted for local clusters that are directly accessible.

login, password (optional)

This can contain a value to be used by drivers for clusters which carry out a connection process which require a login and password or some other similar authorization data.

max_concurrent (optional)

The maximum number of stages to be running concurrently. The value 0 (which is the default) means that there is no limit applied directly by HPCI (although the underlying cluster-specific driver might apply limits of its own). This limit is for the group and there is no mechanism provided at present to manage the number of stages for a user across separately invoked programs. So, if your cluster requires the user to limit the number of separate jobs a user is running simultaneously then this can only be a partial solution for you.

status (provided internally)

After the execute method has been called, this attribute contains the return result from the execution. This is a hash (indexed by stage name). The value for each stage is an array of the return status. (Usually, this array has only one element, but there will be more if the stage was retried. The final element of the array is almost always the one that you wish to look at.) The return status is a hash - it will always contain an element key 'exit_status' giving the exit status of the stage. Additional entries will be found in the hash for cluster-specific return reults. Thus, to check the exit status of a particular stage you would code either:

    $result = $group->execute;
    if ($result->{SOMESTAGENAME}[-1]{exit_status}) {
        die "SOMESTAGENAME failed!";
    }

or:

    $group->execute;
    # ...
    if ($group->status->{SOMESTAGENAME}[-1]{exit_status}) {
        die "SOMESTAGENAME failed!";
    }

file_system_delay

Shared files systems can have a delay period during which an action on the file system is not yet visible to another node sharing that file system. This is common for NFS shared file systems, for example.

The file_system_delay attribute can be given a non-zero number of seconds to indicate the amount of time to wait to ensure that actions taken on another node are visible.

This is used for internal actions such as validating that required stage output files have been created.

METHODS

$group->execute

Execute the stages in the group. Does not return until all stages are complete (or have been skipped because of a failure of some other stage or the attempt is aborted).

AUTHOR

Christopher Lalansingh - Boutros Lab

John Macdonald - Boutros Lab

ACKNOWLEDGEMENTS

Paul Boutros, Phd, PI - Boutros http://www.omgubuntu.co.uk/2016/03/vineyard-wine-configuration-tool-linuxLab

The Ontario Institute for Cancer Research