The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Aim and features

Palantir (Post-processing Analysis tooL for ANTIsmash Reports) is a toolbox for supporting genome mining analyses based on antiSMASH reports, one of the most comprehensible and up-to-date available pipeline for the detection of secondary metalobism pathways. This package offers two different sets of functionalities. On the one hand, Palantir offers methods for helping the user to manipulate and analyze BGC information for small and large-scale genome mining projects: (1) FASTA sequence extraction at any BGC level, (2) PDF/Word reporting and (3) SQL tables generation for advanced data management.

On the other hand, Palantir aims to achieve a more complete and accurate in silico characterization of NRPS/PKS enzymatic systems with several methods: (4) module delineation, (5) gap-filling for completing the BGC annotation, and (6) the dynamic elongation of their core domain sequences. Moreover, (7) a visualization functionality allows the user to easily check the refinements applied to the BGC domain architecture and compare these with antiSMASH version. Finally, (8) an "exploratory mode" devised to interpret the architecture from scratch, i.e., without any bias from a previously defined consensus, is also provided.

Usage

Installation

Palantir is written in Modern Perl but relies on several external dependencies (see below). You should download and install the corresponding binaries the way you feel the most appropriate for your system:

- HMMER3 (http://hmmer.org/download.html)

(This might be installed with sudo apt-get install hmmer).

- sqlite3 (https://www.sqlite.org/download.html).

(Needed to use export_bgc_sql_tables.pl, this might be installed with sudo apt-get install sqlite3).

- Inkscape (https://inkscape.org/release/)

(Needed to use generate_bgc_report.pl, this might be installed with sudo apt-get install inkscape).

- pandoc (https://pandoc.org/installing.html)

(Needed to use generate_bgc_report.pl, this might be installed with sudo apt-get install pandoc).

Also, it is needed to install libgd-dev package to make work the Perl package GD::Simple.

        sudo apt-get install libgd-dev

Most other dependencies can be handled automatically by using cpanm in a Perlbrew environment https://perlbrew.pl/. Below are a set of commands to setup such an environment on Ubuntu.

    # install development tools
    $ sudo apt-get update
    $ sudo apt-get install build-essential

    # download the perlbrew installer...
    $ wget -O - http://install.perlbrew.pl | bash

    # initialize perlbrew
    $ source ~/perl5/perlbrew/etc/bashrc
    $ perlbrew init

    # search for a recent stable version of the perl interpreter
    $ perlbrew available
    # install the last even version (e.g., 5.24.x, 5.26.x, 5.28.x)
    # (this will take a while)
    $ perlbrew install perl-5.26.2
    # install cpanm (for Perl dependencies)
    $ perlbrew install-cpanm

    # enable the just-installed version
    $ perlbrew list
    $ perlbrew switch perl-5.26.2

    # make perlbrew always available
    # if using bash (be sure to use double >> to append)
    $ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.bashrc
    # if using zsh (only the destination file changes)
    $ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.zshrc

Major Palantir dependencies are the Bio::MUST series of modules. Install them as follows.

    $ cpanm Bio::FastParsers
    $ cpanm Bio::MUST::Core

Since Bio::MUST modules rely on external bioinformatics programs and come with complex test suites, they sometimes raise errors during installation. If you encounter any such error, consider enabling --force and/or --notest options of cpanm.

    $ cpanm --force Bio::MUST::Core

Install Palantir itself. All remaining dependencies can also be taken care of by cpanm.

    $ cpanm Bio::Palantir

Input

Palantir accepts report files from antiSMASH version 3 and 4 (biosynML.xml), and from the newer version 5 (regions.js). The biosynML.xml reports are not generated by default in antiSMASH 4 and need the --enable-biosynml option to be written in the result repository. The regions.js file can be obtained from the results downloaded from the antiSMASH web server (https://antismash.secondarymetabolites.org) or with the standalone version.

Also, FASTA files containing BGC sequences are used by explore_bgc_domains.pl.

Binaries for the management of BGC data

extract_bgc_sequences.pl - FASTA sequence extraction at any BGC level

(This script uses Palantir functionalities 1, 4, 5 and 6)

Protein sequences are useful in most downstream analyses performed on identified BGC. extract_bgc_sequences.pl gives an easy access to this information.

The most basic usage extracts Palantir annotation for every BGC gene present in the report:

        extract_bgc_sequences.pl --report-file=antismash5_report/regions.js

The extracted gene sequences will be stored by default in the bgc_sequences.fasta file. The output filename can be specified with --outfile option.

If you want to extract specific BGC types, you can use --types option.

You can find the list of types by using:

        extract_bgc_sequence.pl --help

Depending on the analysis you intend to do, it is also possible to specify which scale you would like to use for extracting sequences: cluster, gene, module or domain (by default: gene).

N.B.: module and domain scales are only allowed for NRPS and type 1 PKS based enzymes.

Here is an example of a more specific command line:

        extract_bgc_sequences.pl --report-file=antismash5_report/regions.js \
        --types=nrps --scale=domain --outfile=strain1_domains.fasta

Furthermore, if you are only interested in antiSMASH annotation, you can use --annotation=antismash option.

Finally, for adding information into the sequence IDs of the FASTA file, you can use --prefix. For instance, --prefix=strain1 will give sequence IDs such as ">Strain1@Cluster...".

generate_bgc_report.pl - (2) PDF/Word reporting

(This scripts uses Palantir functionality 2)

To format antiSMASH report in an easier format for reading, generate_bgc_report.pl offers users PDF/Word docx reporting.

The PDF/Word report is constituted of one BGC/page and resume basic information (type, coordinates, size and the BGC map). In case of NRPS/PKS BGCs, the list of domain and product monomers is also given.

Here is an example of basic command line use:

        generate_bgc_report.pl --report-file=antismash4_report/biosynML.xml \
        --filetype=pdf

N.B.: this script does not work for antiSMASH 5.

--filetype option allows the user to choose between PDF and Word docx output (values: pdf or docx, default: pdf).

--types and --outfile options are available and work as explained in the extract_bgc_sequences.pl section.

export_bgc_sql_tables.pl - (3) SQL tables generation

(This script uses Palantir functionalities 3, 4, 5 and 6)

Whether to do data visualization or statistics, a SQL database is useful for the analysis of large-scale and hierarchically organized data. export_bgc_sql_tables.pl exports the BGC information into SQL tables (you can then choose the SQL database engine) and sets up an sqlite3 database.

Basic command line example:

        export_bgc_sql_tables.pl --infiles antismash_report1/regions.js \
        antismash_report2/regions.js antismash_report3/regions.js

--infiles option allows the user to specify multiple reports at once.

Several options are available:

--db-name: database name (default: bgc_db). --cpu: number of cpus to use (default: 1).

To give easily many reports in input, --file-table allows the user to provide a text file with the list of antiSMASH report paths.

For example:

reports.list: antismash5_reports/strain1/regions.js antismash5_reports/strain2/regions.js antismash4_reports/strain3/biosynML.xml antismash3_reports/strain4/biosynML.xml ...

Additionally, --new-db can be used to erase a pre-existing result repository.

Here is an advanced command line example:

        export_bgc_sql_tables.pl --file-table=reports.list --types=nrps t1pks \
        --db-name=strain1_db --cpu=2

Also, some advanced options can tweak the way Palantir annotates NRPS/PKS BGCs:

--gap-filling, when enabled, tries to find domains in the gaps (>=250aa) from antiSMASH BGC annotations by using a second detection run (default: 1)

--undef-recov, some domains from antiSMASH reports do not possess a defined function value (such as 'C',...) and are then uninformative. This option tries to recover this value for completing the antiSMASH annotation (this is done by default for the Palantir one) by running a detection run on the domain sequences (default: 0).

--undef-cleaning, when the domain function value is undefined by antiSMASH and not retrieved, this option removes these domains from Palantir BGC annotation (default: 1).

By default, we enable these three options as we think it helps achieving a more complete BGC annotation.

generate_bgc_dnz_table.pl - generate a denormalized table of BGC data

(This script uses Palantir functionalities 5 and 6)

For supporting manual data extraction with Excel or downstream analyses with a programming language, such as R or Python, generate_bgc_dnz_table.pl provides a denormalized TSV (Tab-Separated Values). This denormalized table consists in rows containing iteratively all the data from the different BGC scales (cluster, gene, domain).

Basic command line usage:

        generate_bgc_dnz_table.pl --report-file=antismash5_report/regions.js

Several options are available:

--types: as explained in the extract_bgc_sequences.pl section, BGC types to filter. --outfile: output filename. --id: ID to be use as first column of the table (e.g., the organism name), which is usefull to paste tables together. --annotation: annotation version to use (palantir or antismash).

Binaries for the refinement of the annotation of NRPS/PKS BGCs

draw_bgc_maps.pl - draw BGC maps

(This scripts uses Palantir functionalities 4, 5, 6 and 7)

As NRPS and PKS BGCs are constituted of different layers (genes, modules and domains), visualizing the maps of these BGCs is an easy way to compare different annotations. draw_bgc_maps.pl offers the mapping of three annotation versions: antiSMASH, Palantir and Palantir's exploratory mode.

Here is a basic usage example of this script:

        draw_bgcs.pl --report-file=examples/antismash5_report/regions.js \
        --mode=all --label=symbol

--mode: BGC annotation to draw, you can choose between: all, palantir, exploratory and antismash (default: all).

--label: domain label to use on the map, three are available: function, symbol or subtype. 'symbol' corresponds to the letter used to represent a domain function (e.g., 'C' for condensation or 'KS' for ketosynthase domain), while 'function' uses the complete domain name provided by the protein signature. Finally, 'subtype' adds the subtype information for domains responsible of the substrate activation and the condensation activity (e.g., LCL C domain or Val A domain).

The label contains the prediction E-value in Palantir annotations).

More options:

--verbose: prints additionnal information concerning domains (function, coordinates and sequences). --outdir: directory name where PNG files will be generated (default: ./png/). --prefix: String to use for prefixing PNG files.

explore_bgc_domains.pl - Reports all detected NRPS/PKS domain without architecture consensus

(This script uses Palantir functionalities 5, 6, and 8)

The NRPS and PKS domains predicted with antiSMASH rely on detection rules (i.e., E-value and % of the domain signature covered during HMMER analyses) that have been improved over time by antiSMASH authors for separating true from spurious predictions. However, it happens that true domains (that may have a divergent sequence or be truncated) are discarded whereas they could add more insight in the BGC annotation. Moreover, the architecture of NRPS/PKS BGCs given by antiSMASH are often the result of a consensus architecture among overlapping detected domain signatures.

If antiSMASH rules for validating the presence of a domain or determining the BGC architecture are usually reliable, they may sometimes lead to incomplete or incoherent BGC architectures. In order to improve this, explore_bgc_domains.pl provides an unbiased view (by a pre-existing architecture consensus) of the BGC composition. For this, this script exports in TSV and JSON format data from all detected domain signatures.

This script, unlike the others, uses a FASTA file of NPRS/PKS data in input (this file can be created from an antiSMASH report with extract_bgc_sequences.pl).

Here is a command line example:

        explore_bgc_domains.pl --fasta-file=strain1_bgc_sequences.fasta \
        --outfile=strain1_exploratory_domains

which will produce two_files in output: 'strain1_exploratory_domains.tsv' and 'strain1_exploratory_domains.json'.

This script does not take other option than --outfile.