The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::Grid::Run::SGE - Distribute (biological) analyses on the local SGE grid

SYNOPSIS

You want to distribute computational tasks on the cluster nodes. A simple example would be to calculate the reverse complement of 10,000,000,000,000,000,000 sequences in a FASTA file in a distributed fashion.

First, create a perl script cl_reverse_complement.pl that executes the analysis in the Bio::Grid::Run::SGE environment.

  use Bio::Grid::Run::SGE;
  use Bio::Gonzales::Seq::IO qw/faslurp faspew/;

  run_job(
    {
      task => sub {
        my ( $c, $result_file_name_prefix, $input) = @_;

        # we are using the "General" index, so $input is a filename
        # containing some sequences

        # read in the sequences
        my @sequences = faslurp($input_file_name);

        # iterate over them and 
        for my $seq (@sequences) {
          $seq->revcom;
          # calculate the reverse complement
        }
        # finally write the sequences to a results file specific for the current job
        faspew( $result_file_name_prefix . ".fa", @sequences );

        # return 1 for success (0/undef for error)
        return 1;
      },
    }
  );

  exit;

Second, create a config file conf.yml (YAML format) to specify file names and pipeline parameters.

  ---
  input:
  # use the Bio::Grid::Run::SGE::Index::General index 
  # to index the sequence files
  - format: General
    # an array of one or more sequence files
    files: [ 'sequences.fa' ]
    # fasta headers start with '>'
    sep: '^>'
  job_name: reverse_complement
  # iterate consecutively through all sequences 
  # and call cl_reverse_complement.pl on it
  mode: Consecutive

Third, with this basic configuration, you can run the reverse complement distributed on the cluster by invoking

  perl cl_reverse_complement.pl conf.yml

There are a lot more options, indices and modes available, see DESCRIPTION for more info.

INSTALLATION

1. Install Bio::Grid::Run::SGE from CPAN
2. create a global config file $HOME/.bio-grid-run-sge.conf.yml. For now, you can leave it empty. It might be wise to restrict reading permission, as you might use it to store account details (email, jabber, twitter, etc.) for job notifiations.
  chmod 600 ~/.bio-grid-run-sge.conf

Example content looks like:

  ---
  notify:
    mail:
      dest: person.in.charge@example.com
      smtp_server: smtp.example.com
    jabber:
      jid: grid-report@jabber.example.com/grid_report
      password: ...
      dest: person-in-charge@jabber.example.com
3. Do the stuff in "SYNOPSIS"

DESCRIPTION

The general flow starts at running the cluster script. The script defines an index and an iterator. Indices describe how to split the data into chunks, whereas iterators describe in what order these chunks get fed to the cluster script.

Once the script is started, pre tasks are run and the index is set up. You have to confirm the setup to start the job on the cluster. Bio::Grid::Run::SGE is submitting then the cluster script as array job to the cluster.

Output is stored in the result folder, intermediate files are stored in the temporary folder. The temporary folder contains scripts to rerun failed jobs, update the job status, standard error and output, files containing data chunks and additional log information.

Bio::Grid::Run::SGE SCRIPT FILE STRUCTURE

Run stuff before the job is started (pre_task)

Run the job (task)

Input data

Run stuff after the job finished (post_task)

INPUT INDICES

ITERATION MODES

CONFIGURATION FILES

input section

  ---
  input:
  - format: General
    #files, list and elements are synonyms
    files:
    - ../03_clean_evidence/result/merged.fa.clean
    chunk_size: 30
    sep: ^>
    sep_remove: 1
    sep_pos: '^'/'$'
    ignore_first_sep: 1

  - format: List
    list: [ 'a', 'b', 'c' ]
    
  - format: FileList
    files: [ 'filea', 'fileb', 'filec' ]

  - format: Range
    list: [ 'from', 'to' ]

  job_name: NAME
  mode: Consecutive/AvsB/AllvsAll/AllvsAllNoRep

  args: [ '-a', 10, '-b','no' ]
  test: 2
  no_prompt: 1

  parts: 3000
  # or
  combinations_per_job: 300

  result_dir: result_gff
  working_dir:
  stderr_dir:
  stdout_dir:

  log_dir: dir
  tmp_dir: dir
  idx_dir: dir

  prefix_output_dirs: 

path specifictation in the config file

If the config file contains relative paths, the following policy is used:

1. The working_dir config entry is used as "root".
2. If no working_dir config entry is specified, the directory of the config file is set to the working/root dir.
3. If no config file is specified (yes, this is possible, but not recommended), the current dir is used as working/root dir.

The working directory needs to exist.

INCLUDED 3RD PARTY SOFTWARE

To show running time of jobs, distribution was used. The script is distributed under GPL, so honor that if you use this package. I personally have to thank Tim Ellis for creating such an nice script.

SEE ALSO

Bio::Gonzales Bio::Grid::Run::SGE::Util

AUTHOR

jw bargsten, <joachim.bargsten at wur.nl>