The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

        HPCD::SLURM::Run

SYNOPSIS

        use HPCD::SLURM::Run;

DESCRIPTION

        This module helps execute srun, scancel and sacct.
        srun: This module puts together user input for stage attributes and submits the srun command for
        the system to run.
        sacct: This module contains a method to parse account information given
        back by the sacct command.
        scancel: This module executes the scancel command when the job needs to be killed.

ATTRIBUTE

stats_keys

stats_keys record the names of the account information keys. Default to @job_info_fields.

METHODS

after '_collect_job_stats'

        Calls the subroutine _collect_sacct_info, updates the hash in the stats attribute
        with the exit status and accounting info.

_collect_sacct_info

        Executes the command 'sacct --name $id --format=$key' n times, where $key is a singular key
        returned by sacct -e and n is the number of     keys returned by sacct -e. Maps $key and the
        returned account information value in hash %info and returns %info as the result.

        The reason why sacct is called multiple times instead of once (by sacct -name $id --format=
        $key1,$key2,$key3...) is that sometimes the value field might be blank, e.g. the result of
        the command sacct -name $id --format=Account,User,Comment,ReqMem might be

        Account User    Comment ReqMem
        ----------------------------------
                jdoe                    2Gn

        It is thus difficult to parse the values and map them with corresponding keys. By calling
        --format=$key separately for each $key value each time, we can catch all the blank values and
        ensure that the key-value pairs are matched correctly.

around 'soft_timeout'

        Replaces the original 'soft_timeout' method in HPCI, and cancels the job directly.

around 'hard_timeout'

        Cancels the job before calling the original HPCI method 'hard_timeout'.

        The original hard_timeout sends a kill signal to the child process. In this case,
        that is the "srun" program, not the actual child job (which is on some other
        computer so kill cannot be used). The sleep and continue with sending the kill
        signal at least cleans up the local process if the cancellation does not work
        properly. Usually it will, and the kill will be sent to a process that has terminated
        already.

_delete_job

        Terminates the job by calling scancel -n $id.

_to_MB

        A subroutine which converts any memory value in unit KMGT to a number in MB,
        since the srun --mem= option takes only a number which by default is in MB.

        Example:
                $self->_to_MB('2G') would return 2048
                $self->_to_MB('100M') would return 100

_reformat_time

        A subroutine which reformats the input $sec (a number in seconds) to either
        minute:second, hour:minute:second, or day-hour:minute:second, which are the
        formats acceptable by the srun --time= option.

        Example:
                $self->_reformat_time(1) would give '0:1'
                $self->_reformat_time(70) would give '1:10'
                $self->_reformat_time(3601) would give '1:0:1'
                $self->_reformat_time(86400) would give '1-0:0:0'

_res_value_map

        A subroutine which reformats key and value in stage attribute resources_required
        to the option format acceptable by srun.

        Example:
                If the key and value in resources_required is 'mem' and '3G', then
                _res_value_map would give '--mem=3072' as the output.

_get_mapped_resources_string

        A subroutine which maps parameters in resources_required to a string of srun
        options.

        Example:
                Say resources_required is {"mem" => "100M", "h_time" => 1000}, then the output
                will be '--mem=100 --time=16:40'.

_get_submit_command

        A subroutine which incorporates attributes of one certain stage (i.e. shell_script,
        unique name, stdout, stderr, native_args_string, resrouces_required) into one single
        srun command for the system to execute.

        Example:
                If the stage has its script_file named script.sh, unique_id being NAME12345,
                native_args_string being "-N 2 -n 4 --mail-type=ALL --mail-user=jdoe@xyz.com",
                resources_required being {"mem" => "5G", "h_time" => 200}, then the output of this
                subroutine would be 'srun -N 2 -n 4 --mail-type=ALL --mail-user=jdoe@xyz.com
                --mem=5120 --time=3:20 -J NAME12345 -o someoutputpath -e someerrorpath script.sh'.

AUTHOR

John Macdonald - Boutros Lab

Anqi (Joyce) Yang - Boutros Lab

ACKNOWLEDGEMENTS

Paul Boutros, Phd, PI - Boutros Lab

The Ontario Institute for Cancer Research