BioX::Workflow::Plugin::Drake - A very opinionated template based bioinformatics workflow writer for Drake.
The main documentation for this module is at BioX::Workflow. This module extends Workflow in order to add functionality for outputing workflows in drake format.
biox-workflow.pl --workflow workflow.yml > workflow.drake drake --workflow workflow.drake #with other functionality such as --jobs for asynchronous output, etc. List your plugins in your workflow.yml file --- plugins: - Drake global: - indir: /home/user/gemini - outdir: /home/user/gemini/gemini-wrapper - file_rule: (.vcf)$|(.vcf.gz)$ - infile: - min: 1 ##IF USING MIN #So On and So Forth
More information about Drake can be found here https://github.com/Factual/drake.
BioX::Workflow::Plugin::Drake assumes your INPUT/OUTPUT and indir/outdirs are linked.
This means the output from step1 is the input for step2.
You can override this behavior by either declaring any of these values, or in the global variables set auto_input: 0, disable automatic indir/outdir naming with auto_name: 0, and disable automatically naming outdirectories by rule names with enforce_struct: 0.
--- plugins: - Drake global: - indir: /home/user/workflow - outdir: /home/user/workflow/output - file_rule: (.csv)$ rules: - backup: local: - INPUT: "{$self->indir}/{$sample}.csv" - OUTPUT: "{$self->outdir}/{$sample}.csv" - thing: "other thing" process: | cp $INPUT $OUTPUT - grep_VARA: local: - OUTPUT: "{$self->outdir}/{$sample}.grep_VARA.csv" process: | echo "Working on {$self->{indir}}/{$sample.csv}" grep -i "VARA" {$self->indir}/{$sample}.csv >> {$self->outdir}/{$sample}.grep_VARA.csv \ || touch {$self->OUTPUT} - grep_VARB: local: - OUTPUT: "{$self->outdir}/{$sample}.grep_VARA.grep_VARB.csv" process: | grep -i "VARB" {$self->indir}/{$sample}.grep_VARA.csv >> {$self->outdir}/{$sample}.grep_VARA.grep_VARB.csv || touch {$self->OUTPUT}
Drake will stop everything if you're job returns with an exit code of anything besides 0. For this reason we have the last command have a command1 || command2 syntax, so that even if we don't grep any "VARB" from the file the workflow could continue.
biox-workflow.pl --workflow workflow.yml > workflow.full.drake
I don't want to inlcude the whole file, but you get the idea
; ; Generated at: 2015-06-21T11:01:24 ; This file was generated with the following options ; --workflow drake.yml ; --min 1 ; ; ; Samples: test1, test2 ; ; ; Starting Workflow ; ; ; Starting backup ; ; ; Variables ; Indir: /home/guests/jir2004/workflow ; Outdir: /home/guests/jir2004/workflow/output/backup ; Local Variables: ; INPUT: {$self->indir}/{$sample}.csv ; OUTPUT: {$self->outdir}/{$sample}.csv ; thing: other thing ; /home/guests/jir2004/workflow/output/backup/$[SAMPLE].csv <- /home/guests/jir2004/workflow/$[SAMPLE].csv cp $INPUT $OUTPUT ; ; Ending backup ; ; ; Starting grep_VARA ;
Run drake
drake --workflow workflow.full.drake The following steps will be run, in order: 1: /home/user/workflow/output/backup/test1.csv <- /home/user/workflow/test1.csv [timestamped] 2: /home/user/workflow/output/backup/test2.csv <- /home/user/workflow/test2.csv [timestamped] 3: /home/user/workflow/output/grep_vara/test1.grep_VARA.csv <- /home/user/workflow/output/backup/test1.csv [projected timestamped] 4: /home/user/workflow/output/grep_vara/test2.grep_VARA.csv <- /home/user/workflow/output/backup/test2.csv [projected timestamped] 5: /home/user/workflow/output/grep_varb/test1.grep_VARA.grep_VARB.csv <- /home/user/workflow/output/grep_vara/test1.grep_VARA.csv [projected timestamped] 6: /home/user/workflow/output/grep_varb/test2.grep_VARA.grep_VARB.csv <- /home/user/workflow/output/grep_vara/test2.grep_VARA.csv [projected timestamped] Confirm? [y/n] y Running 6 steps with concurrence of 1... --- 0. Running (timestamped): /home/user/workflow/output/backup/test1.csv <- /home/user/workflow/test1.csv --- 0: /home/user/workflow/output/backup/test1.csv <- /home/user/workflow/test1.csv -> done in 0.02s --- 1. Running (timestamped): /home/user/workflow/output/backup/test2.csv <- /home/user/workflow/test2.csv --- 1: /home/user/workflow/output/backup/test2.csv <- /home/user/workflow/test2.csv -> done in 0.01s --- 2. Running (timestamped): /home/user/workflow/output/grep_vara/test1.grep_VARA.csv <- /home/user/workflow/output/backup/test1.csv Working on /home/user/workflow/output/backup/test1csv --- 2: /home/user/workflow/output/grep_vara/test1.grep_VARA.csv <- /home/user/workflow/output/backup/test1.csv -> done in 0.01s --- 3. Running (timestamped): /home/user/workflow/output/grep_vara/test2.grep_VARA.csv <- /home/user/workflow/output/backup/test2.csv Working on /home/user/workflow/output/backup/test2csv --- 3: /home/user/workflow/output/grep_vara/test2.grep_VARA.csv <- /home/user/workflow/output/backup/test2.csv -> done in 0.01s --- 4. Running (timestamped): /home/user/workflow/output/grep_varb/test1.grep_VARA.grep_VARB.csv <- /home/user/workflow/output/grep_vara/test1.grep_VARA.csv --- 4: /home/user/workflow/output/grep_varb/test1.grep_VARA.grep_VARB.csv <- /home/user/workflow/output/grep_vara/test1.grep_VARA.csv -> done in 0.01s --- 5. Running (timestamped): /home/user/workflow/output/grep_varb/test2.grep_VARA.grep_VARB.csv <- /home/user/workflow/output/grep_vara/test2.grep_VARA.csv --- 5: /home/user/workflow/output/grep_varb/test2.grep_VARA.grep_VARB.csv <- /home/user/workflow/output/grep_vara/test2.grep_VARA.csv -> done in 0.08s Done (6 steps run).
As an alternative you can run this with the --min option, which instead of printing out each workflow prints out only one, and creates a run-workflow.sh which has all of your environmental variables.
This option is preferable if running on an HPC cluster with many nodes.
This WILL break with use of --resample, either local or global. You need to split up your workflows as opposed to using the --resample option.
biox-workflow.pl --workflow workflow.yml --min 1 > workflow.drake #This also creates the run-workflow.sh in the same directory ./run-workflow.sh cat drake.log #Here is the log for the first run 2015-06-21 14:02:47,543 INFO Running 3 steps with concurrence of 1... 2015-06-21 14:02:47,568 INFO 2015-06-21 14:02:47,570 INFO --- 0. Running (timestamped): /home/user/workflow/output/backup/test1.csv <- /home/user/workflow/test1.csv 2015-06-21 14:02:47,592 INFO --- 0: /home/user/workflow/output/backup/test1.csv <- /home/user/workflow/test1.csv -> done in 0.02s #So on and so forth
If you look in the example directory you will see a few png files, these are outputs of the drake workflow.
=cut
Before version 0.03
This module was originally developed at and for Weill Cornell Medical College in Qatar within ITS Advanced Computing Team. With approval from WCMC-Q, this information was generalized and put on github, for which the authors would like to express their gratitude.
As of version 0.03:
This modules continuing development is supported by NYU Abu Dhabi in the Center for Genomics and Systems Biology. With approval from NYUAD, this information was generalized and put on bitbucket, for which the authors would like to express their gratitude.
You shouldn't need these, but if you do here they are.
Print the whole workflow hardcoded. This is the default
Print the workflow as 2 files.
Run the drake things
drake --vars "SAMPLE=$sample" --workflow/workflow.drake
workflow.drake
Our regular file
Subroutines
Must initialize some variables
Things to do if we decide to do a min version
Fill in the template with the process
Ensure INPUT/OUTPUT exist
Prettyify the output a bit
Jillian Rowe <jillian.e.rowe@gmail.com>
Copyright 2015- Jillian Rowe
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install BioX::Workflow::Plugin::Drake, copy and paste the appropriate command in to your terminal.
cpanm
cpanm BioX::Workflow::Plugin::Drake
CPAN shell
perl -MCPAN -e shell install BioX::Workflow::Plugin::Drake
For more information on module installation, please visit the detailed CPAN module installation guide.