—=head1 Name
HPC::Runner::Usage
=head1 HPC-Runner-Scheduler
=head2 Overview
The HPC::Runner modules are wrappers around running Gnu Parallel,
Parallel::ForkManager, MCE/MCE::Queue, and job submission to a Slurm or PBS
queue
=head2 Submit your commands
##Submit a job to slurm to spread across nodes
slurmrunner.pl --infile /path/to/fileofcommands --outdir slurmoutput --jobname slurmjob
##Run in parallel using threads on a single noce
parallelrunner.pl --infile /path/to/fileofcommands --outdir threadsoutput --procs 4
##Run in parallel using MCE on a single node
mcerunner.pl --infile /path/to/fileofcommands --outdir threadsoutput --procs 4
=head3 Run Your Command
The idea behind the HPC::Runner modules is to be able to run arbitrary
bash with proper logging, catching STDOUT/ERROR and exit status, and
when possible to run jobs in parallel with some job flow.
The modules are written with Moose, and can be overwritten and
extended.
Logging is done with Log::Log4perl.
HPC::Runner is a base class thas has the variables common among
HPC::Runner::Threads, HPC::Runner::MCE, and HPC::Runner::Slurm. All
three modules have use a similar philosophy, but different technologies
to implement it. For myself this was a workaround so I didn't have to
learn to write MPI scripts, or have every tool be written into some
sort of workflow manager.
The different runners each come with an executable script that should
be installed in your path: mcerunner.pl, parallelrunner.pl, and
slurmrunner.pl.
=head1 An Indepth Look
=head2 Single Node Execution
If you only have a single node to execute on, but still have many
threads/processes available, you can use HPC::Runner::MCE or
HPC::Runner:Threads. Both have job logging and workflow control.
=head2 Example for Runner::Threads and Runner::MCE
An example infile would contain any command that can be executed from
the command line. All the modules have a basic level of workflow
management, meaning you can use the command 'wait' to wait for all
other threads/processes to finish.
In the example directory there is a script calledE<Acirc>
testioselect.pl. It is 100% from a thread on perlmonks discussing the
proper use of IPC::Open3 found here.E<Acirc>
http://www.perlmonks.org/?node_id=151886. I based all the usage of
running bash commands from the user abstract's post, only adding in the
parts for logging.
You could create a test_threads/mce.in with the following.
test_threads/mce.in
It is ALWAYS a good idea to put full paths when running arbitrary
scripts. Jobs will always be run from the your current working directory, but its still a good idea!
And submit that to the Runner::MCE/Threads with the following.
#Using MCE
mcerunner.pl --infile test_mce.in --outdir `pwd`/test --procs 4
# OR
# Using Parallel::Forkmanager
parallelrunner.pl --infile test_mce.in --outdir `pwd`/test --procs 4
Which would generate you the the test directory, and logs for the
commands detailing STDOUT/STDERR, time and date, and run those commands
4 threads/processes at a time.
Each command gets its own log file, as well as a MAIN log file to
detail how the job is running overall.
=head3 Trouble Shooting mcerunner and parallelrunner
First of all, make sure your jobs run without the wrapper script.
Runner::Threads/MCE only makes sure your threads/processes start. It
does not make sure your jobs exit successfully, but the exitcode will
be in your log.
View your logs in
outdir/prunnerI<runner>datetimeI<randomstr/CMD#>datetime. This will
give you the STDOUT/STDERR.
=head3 Full Path Names
Please give all your commands, infiles, and
outdirectories the full path names. If you are executing arbitrary
script you should give either the full path name or the path should be
in your ENV{PATH}. HPC::Runner::Init will do some guessing on the infile and
outdir parameters using File::Spec, but this is no guarantee!
If you are using Runner::Threads your perl must be installed with
thread capabilities.
=head2 MultiNode Job Submission
This documentation was initially written for Slurm, but for the most part the
concepts and requirements are the same across schedulers (Slurm, PBS, SGE, etc).
=head2 HPC::Runner::Slurm Example
HPC::Runner::Slurm adds another layer to HPC::Runner::MCE or
HPC::Runner::Threads by submitting jobs to the queing system Slurm.
https://computing.llnl.gov/linux/slurm/. Slurm submits its jobs to
different machines, or nodes, across a cluster. It is common for many
users sharing the same space.
When I was first using slurm I wanted something that would
automatically distribute my jobs across the cluster in a way that would
get them done reasonably quickly. Most of the jobs being submitted were
'embarassingly parallel' and did not require much of the fine tuning
slurm is capable of. For most jobs what we wanted to be able to do was
take a list of jobs, chop them into pieces, take each piece and send it
to a node, and then on that node run those jobs in parallel.
=head3 alljobs.in
job1
job2
job3
job4
# Lets tell mcerunner/parallelrunner/slurmrunner.pl to execute jobs5-8 AFTER jobs1-4 have completed
# wait
job5
job6
job7
job8
What I want is for Slurm to take 4 jobs at a time, submit those to a
node. I don't want to do this all manually for obvious reasons.
=head3 Slurm Template
#!/bin/bash
#SBATCH --share
#SBATCH --get-user-env
#SBATCH --job-name=alljobs_batch1
#SBATCH --output=batch1.out
#SBATCH --partition=bigpartition
#SBATCH --nodelist=node1onbigpartion
#Here are the jobs!
job1
job2
job3
job4
Ok, I don't really want that. I want all the logging, and since those
jobs don't depend on one another I want to run them all in parallel.
Because that is what HPC is all about. ;) So I run this command instead
that uses the script that comes with Runner::Slurm.
slurmrunner.pl --infile pwd/alljobs.in --jobname alljobs --outdir pwd/alljobs
And have the following template files created and submitted to the
queue.
Although it is not required to supply a jobname or an outdir, it is
strongly recommended especially if you are submitting multiple jobs.
=head3 Slurm Template with Batched Job
#!/bin/bash
##alljobs_batch1.sh
#SBATCH --share
#SBATCH --get-user-env
#SBATCH --job-name=alljobs_batch1
#SBATCH --output=batch1.out
#SBATCH --partition=bigpartition
#SBATCH --nodelist=node1onbigpartion
#SBATCH --cpus-per-task=4
#
# #Take out jobs, batch them out to a node, and run them in parallel
# mcerunner.pl --infile batch1.in --procs 4 --outdir /outdir/we/set/in/slurmrunner.pl
Where batch1.in contains our jobs1-4. The number that is in
--cpus-per-task should be greater than or equal to the maximum number
of threads/processes that are run in parallel (procs). The default
values in HPC::Runner::Slurm are fine, but if you change them make sure
you stick with that rule.
This template and batch1.in is generated by the command and is
submitted with the slurmjobid 123.
Then the next job batch is generated as alljobs_batch2.sh, and we tell
slurm we want for it to be submitted after jobs1,2,3,4 exit
successfully.
=head3 Slurm Template with Dependency
#!/bin/bash
##alljobs_batch2.sh
#SBATCH --share
#SBATCH --get-user-env
#SBATCH --job-name=alljobs_batch2
#SBATCH --output=batch2.out
#SBATCH --partition=bigpartition
#SBATCH --nodelist=node2onbigpartion
#SBATCH --cpus-per-task=4
#
# #Don't start this job until 123 submits successfully
# SBATCH --dependency=afterok:123
#
# mcerunner.pl --infile batch2.in --procs 4 --outdir /outdir/we/set/in/slurmrunner.pl
=head2 Customizing HPC::Runner::Slurm Input
Since the HPC::Runner modules are written in Moose, they can be
overridden and extended in the usual fashion. Logging is done with
Log4Perl, so any of the appenders can be used. The default is to log to
files, but what if you want to log to rsyslog or a database?
=head3 Extend slurmrunner.pl to add your own custom loggin
#!/usr/bin/env perl
#slurmrunner_rsyslog.pl
package Main;
use Moose;
extends 'HPC::Runner::Slurm';
use Log::Log4perl;
#Lets override init_log with our own subroutine...
sub init_log {
my $self = shift;
my $filename = $self->logdir."/".$self->logfile;
my $log_conf =<<EOF;
############################################################
# Log::Log4perl conf - Syslog #
############################################################
log4perl.rootLogger = DEBUG, SYSLOG, FILE
log4perl.appender.SYSLOG = Log::Dispatch::Syslog
log4perl.appender.SYSLOG.min_level = debug
log4perl.appender.SYSLOG.ident = slurmrunner
log4perl.appender.SYSLOG.facility = local1
log4perl.appender.SYSLOG.layout = Log::Log4perl::Layout::SimpleLayout
log4perl.appender.FILE = Log::Log4perl::Appender::File
log4perl.appender.FILE.filename = $filename
log4perl.appender.FILE.mode = append
log4perl.appender.FILE.layout = Log::Log4perl::Layout::PatternLayout
log4perl.appender.FILE.layout.ConversionPattern = %d %p %m %n
EOF
Log::Log4perl::init(\$log_conf);
my $log = Log::Log4perl->get_logger();
return $log;
};
Main->new_with_options->run;
1;
=head2 Trouble Shooting Runner::Slurm
Make sure your paths are sourced correctly for slurm. The easiest way
to do this is add all your paths to your ~/.bashrc, source it, and add
the line
#SBATCH --get-user-env>
to your submit script. By default this is already placed in the
template, but if you decide to supply your own template you may want to
add it.
If you are submitting a script that is not in your path, you probably
want to give the full pathname for it, especially if supplying the
outdir option. In general I think its always best to give the full
pathname.
If you are in the directory already and submitting from bash, just use
backticks around pwd.
Another common error is 'This node configuration is not available'.
This could mean several things.
1. The node is down at the time of job submission
2. You are asking for more resources on a node than it has. If you ask for --cpus-per-task=32 and the node only has 16 cpus, you will get this error.
3. You misspelled the partition or nodename.
Point 2 will be improved upon in the next release so it queries slurm
for the number of cpus available on a node at the time of submission.
For now it must be manually set with --cpus-per-task
=head2 Authors and Contributors
Jillian Rowe in collaboration with the ITS Advanced Computing Team at
Weill Cornell Medical College in Qatar.
=head2 Acknowledgements
This module was originally developed at and for Weill Cornell Medical
College in Qatar. With approval from WCMC-Q, this information was
generalized and put on github, for which the authors would like to
express their gratitude. Also to all the HPC users at WCMCQ, who all
gave their input.
The continued development of the HPC::Runner modules is supported by NYUAD, at
the Center for Genomics and Systems Biology.
=cut