Bio::Gonzales::Project::Functions - organize your computational experiments
Inspired by A Quick Guide to Organizing Computational Biology Projects this module makes it easy to organise computational biology projects.
$ gonzp init human_genome $ cd human_genome/analysis $ gonzp analysis genome_assembly $ cd genome_assembly # set up scripts, Makefile, etc. # ... $ make human_genome_assembly $ gonzp analysis genome_annotation # finds the project directory automatically $ cd ../genome_annotation # set up scripts, Makefile, etc. # ... $ make human_genome_annotation
Create it with gonzp init <project_name>
gonzp init <project_name>
A project consists of a root directory, containing everything, the paper-draft, analyses, 3rd-party documentation (and perhaps literature), scripts, etc. The whole system is based on Makefiles (to start the different analysis steps) and perl modules (surprise, surprise!!).
The documentation goes into the README file, in whatever format (plain text, markdown, textile, ...) you prefer.
README
Thus, the basic layout is of an example project is:
example
example/Makefile (a Makefile to start single analyses) example/README (a overview documentation of the computational experiment) example/analysis/ (all analyses go in here) example/data/ (3rd-party data, such as the uniprot database or experimental results, common to the whole computational experiment go in here) example/paper/ (the paper draft goes in here) example/doc/ (3rd-party documentation) example/lib/ (if some scripts or analyses have a lot in common, creating a module/library might be helpful)
Create it with gonzp analysis <analysis_name>
gonzp analysis <analysis_name>
The analysis directory contains all analyses that have been done. One directory per analysis. The layout in example/analysis is therefore:
example/analysis
./important_computational_experiment/Makefile (the Makefile to start single analysis steps) ./important_computational_experiment/av (the analysis version) ./important_computational_experiment/README (some analysis-specific documentation) ./important_computational_experiment/gonz.conf.yml (configuration stuff, e.g. file locations or parameters) ./important_computational_experiment/2014-01-28/ (the analysis directory derived from the version stored in "av") ./important_computational_experiment/data/ (analysis-specific data) ./important_computational_experiment/playground/ (here you can try stuff) ./important_computational_experiment/bin/ (a directory to store the scripts)
The analysis version is just a single string and defaults to the day the analysis was created. The contents of the av file are e.g.:
av
$ cat important_computational_experiment/av 2014-01-28
Cange it to whatever you want. A common use case is to change input data or parameters without clobbering the previous results. Therefore, change the analysis version to a different date and rerun the whole analysis.
The analysis version is integral part of Bio::Gonzales::Project::Functions and therefore accessible via
as $(AV) variable.
$(AV)
For example you want to calculate the average number of leaves for 4 plant accessions. You have 3 replicates, so 12 records:
Input data data/leaves.txt:
data/leaves.txt
accession num_leaves ACC_001 3 ACC_001 4 ACC_001 6 ACC_002 8 ACC_002 14 ACC_002 12 ACC_003 18 ACC_003 10 ACC_003 12 ACC_004 10 ACC_004 4 ACC_004 7
Script bin/calc_number_of_avg_leaves.pl
bin/calc_number_of_avg_leaves.pl
#!/usr/bin/env perl # created on 2014-01-28 use warnings; use strict; use 5.010; use Bio::Gonzales::Project::Functions; use List::Util qw(sum); # read in some raw data open my $fh, '<', 'data/leaves.txt' or die "Can't open filehandle: $!"; my %num_leaves; <$fh>; # get rid of the header while ( my $line = <$fh> ) { chomp $line; my ( $acc, $num_leaves ) = split /\t/, $line; push @{ $num_leaves{$acc} }, $num_leaves; } close $fh; # nfi = new file in the current analysis version directory # here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version my $result_file = nfi("avg_num_leaves.tsv"); # open the result file open my $result_fh, '>', $result_file or die "Can't open filehandle: $!"; # calculate the result and write it while ( my ( $acc, $leaves ) = each %num_leaves ) { my $sum = sum @$leaves; my $count = scalar @$leaves; my $avg = $sum / $count; say $result_fh join( "\t", $acc, $avg ); } close $result_fh;
The script changes slightly, see here the changed lines:
original:
use Bio::Gonzales::Project::Functions; ... # nfi = new file in the current analysis version directory # here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version my $result_file = nfi("avg_num_leaves.tsv");
changed:
use Bio::Gonzales::Project::Functions qw(:DEFAULT $ANALYSIS_VERSION); # CHANGED ... # here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version my $result_file = "$ANALYSIS_VERSION/avg_num_leaves.tsv"; # CHANGED
The configuration is stored in gonz.conf.yml and accessible via commandline and perl functions. The format of the configuration is YAML. You can therefore freely store any configuration in various data formats, such as lists or dictionaries.
gonz.conf.yml
The access via commandline is intended to be used in the Makefile. The commandline script is called gonzconf. See
Makefile
gonzconf
gonzconf --help
for help. gonzconf looks for the gonzconf.yml and extracts parts of the configuration. Example:
gonzconf.yml
--- genotypes: - genotype_1 - genotype_2 - genotype_3
Make target:
GENOTYPES=$(shell gonzconf --flat genotypes) analysis: for g in $(GENOTYPES); do \ echo "analysing $$g"; \ done
In perl scripts the configuration can be accessed via the gonzconf function.
Calling the function without arguments returns the complete configuration. It can be accessed as normal perl array or hash (depending on the configuration).
Example:
#!/usr/bin/env perl use warnings; use strict; use 5.010; use Bio::Gonzales::Project::Functions; my $config = gonzconf(); my @genotypes = @{$config->{genotypes}}; for my $genotype (@genotypes) { say "analysing genotype $genotype"; }
gonzconf can take one argument to access entries of the top layer directly. By "top layer", gonzconf assumes that the structure of the configuration is organised as hash/dictionary.
#!/usr/bin/env perl use warnings; use strict; use 5.010; use Bio::Gonzales::Project::Functions; my @genotypes = @{gonzconf("genotypes")}; for my $genotype (@genotypes) { say "analysing genotype $genotype"; }
Bio::Gonzales::Project::Functions comes with logging included. The logged info is stored in $ANALYSIS_VERSION/gonzlog. Therefore every analysis has a different log file. 5 log levels are available: debug, info, warn, error, fatal
$ANALYSIS_VERSION/gonzlog
Run
gonzlog <namespace> <message>
to log something. The log level is hardcoded to "info".
Bio::Gonzales::Project::Functions exports the function gonzlog by default. To log stuff you run
gonzlog
gonzlog->info("message"); # or my $log = gonzlog(); $log->info("message");
The namespace is the filename of the invoking script.
jw bargsten, <joachim.bargsten at wur.nl>
<joachim.bargsten at wur.nl>
2 POD Errors
The following errors were encountered while parsing the POD:
'=item' outside of any '=over'
You forgot a '=back' before '=head2'
To install Bio::Gonzales, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::Gonzales
CPAN shell
perl -MCPAN -e shell install Bio::Gonzales
For more information on module installation, please visit the detailed CPAN module installation guide.