Bio::Grid::Run::SGE - Distribute (biological) analyses on the local SGE grid
You want to distribute computational tasks on the cluster nodes. A simple example would be to calculate the reverse complement of 10,000,000,000,000,000,000 sequences in a FASTA file in a distributed fashion.
First, create a perl script cl_reverse_complement.pl that executes the analysis in the Bio::Grid::Run::SGE environment.
use Bio::Grid::Run::SGE; use Bio::Gonzales::Seq::IO qw/faslurp faspew/; run_job( { task => sub { my ( $c, $result_file_name_prefix, $input) = @_; # we are using the "General" index, so $input is a filename # containing some sequences # read in the sequences my @sequences = faslurp($input_file_name); # iterate over them and for my $seq (@sequences) { $seq->revcom; # calculate the reverse complement } # finally write the sequences to a results file specific for the current job faspew( $result_file_name_prefix . ".fa", @sequences ); # return 1 for success (0/undef for error) return 1; }, } ); exit;
Second, create a config file conf.yml (YAML format) to specify file names and pipeline parameters.
--- input: # use the Bio::Grid::Run::SGE::Index::General index # to index the sequence files - format: General # an array of one or more sequence files files: [ 'sequences.fa' ] # fasta headers start with '>' sep: '^>' job_name: reverse_complement # iterate consecutively through all sequences # and call cl_reverse_complement.pl on it mode: Consecutive
Third, with this basic configuration, you can run the reverse complement distributed on the cluster by invoking
perl cl_reverse_complement.pl conf.yml
There are a lot more options, indices and modes available, see DESCRIPTION for more info.
chmod 600 ~/.bio-grid-run-sge.conf
Example content looks like:
--- notify: mail: dest: person.in.charge@example.com smtp_server: smtp.example.com jabber: jid: grid-report@jabber.example.com/grid_report password: ... dest: person-in-charge@jabber.example.com
The general flow starts at running the cluster script. The script defines an index and an iterator. Indices describe how to split the data into chunks, whereas iterators describe in what order these chunks get fed to the cluster script.
Once the script is started, pre tasks are run and the index is set up. You have to confirm the setup to start the job on the cluster. Bio::Grid::Run::SGE is submitting then the cluster script as array job to the cluster.
Output is stored in the result folder, intermediate files are stored in the temporary folder. The temporary folder contains scripts to rerun failed jobs, update the job status, standard error and output, files containing data chunks and additional log information.
--- input: - format: General #files, list and elements are synonyms files: - ../03_clean_evidence/result/merged.fa.clean chunk_size: 30 sep: ^> sep_remove: 1 sep_pos: '^'/'$' ignore_first_sep: 1 - format: List list: [ 'a', 'b', 'c' ] - format: FileList files: [ 'filea', 'fileb', 'filec' ] - format: Range list: [ 'from', 'to' ] job_name: NAME mode: Consecutive/AvsB/AllvsAll/AllvsAllNoRep args: [ '-a', 10, '-b','no' ] test: 2 no_prompt: 1 parts: 3000 # or combinations_per_job: 300 result_dir: result_gff working_dir: stderr_dir: stdout_dir: log_dir: dir tmp_dir: dir idx_dir: dir prefix_output_dirs:
If the config file contains relative paths, the following policy is used:
working_dir
The working directory needs to exist.
To show running time of jobs, distribution was used. The script is distributed under GPL, so honor that if you use this package. I personally have to thank Tim Ellis for creating such an nice script.
Bio::Gonzales Bio::Grid::Run::SGE::Util
jw bargsten, <joachim.bargsten at wur.nl>
<joachim.bargsten at wur.nl>
To install Bio::Grid::Run::SGE, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::Grid::Run::SGE
CPAN shell
perl -MCPAN -e shell install Bio::Grid::Run::SGE
For more information on module installation, please visit the detailed CPAN module installation guide.