The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

export_bgc_sql_tables.pl - Exports SQL tables of BGC data (Palantir and antiSMASH annotations)

VERSION

version 0.211420

NAME

export_bgc_sql_tables.pl - This tool exports SQL tables structuring the BGC data from antiSMASH reports and annotated with Palantir.

USAGE

    $0 [options] --infiles [=] <report_paths>.../--file-table [=] <report.list>

REQUIRED ARGUMENTS

OPTIONAL ARGUMENTS

--infiles [=] <report_paths>...

Paths to biosynML.xml (antiSMASH 3-4) or regions.js (antiSMASH 5) files. This option can takes multiple values.

--file-table [=] <tsv_file>

TSV (Tab-Separated Values) format file to give non ambiguously the path of xml reports, proteomes and quast files. Order : xml reports (1st column), proteomes (2nd column) and quast files (3rd column). If you only want to parse xml and quast reports, you can follow this format : "biosynML.xml undef quast.tsv".

--types [=] <str>...

Filter clusters on a/several specific type(s).

Types allowed: acyl_amino_acids, amglyccycl, arylpolyene, bacteriocin, butyrolactone, cyanobactin, ectoine, hserlactone, indole, ladderane, lantipeptide, lassopeptide, microviridin, nrps, nucleoside, oligosaccharide, otherks, phenazine, phosphonate, proteusin, PUFA, resorcinol, siderophore, t1pks, t2pks, t3pks, terpene.

Any combination of these types, such as nrps-t1pks or t1pks-nrps, is also allowed. The argument is repeatable.

--taxdir [=] <dir>

Path to a local mirror of the NCBI Taxonomy database.

--idm[-file] [=] <file>

Path to an id mapper file to retrieve the assembly accession numbers. The file should be in tabular format with accession numbers in the second column.

--proteomes

Use organism proteome to predict with external pHMMs domains to include in SQL database.

--quast

Create an additionnal table "Assemblies" with Quast statistics. For this option, you need to use the transposed_report.tsv output of quast and name it with the basename of your report file. For example, if you use my_org.xml, name your Quast file my_org.tsv.

--contam-file [=] <file>

Add an SQL table for CheckM contamination results (tabular file). This option was devised for the interface database.

--new-db

Remove the previous sql tables to start over the db.

--db-name [=] <str>

Name of your database [default: bgc_db].

--module-delineation [=] <str>

Method for delineating the modules. Modules can either be cut on condensation (C and KS) or substrate-selection domains (A and AT) [default: 'substrate-selection'].

--discard-other-type

Discard clusters characterized as "other" by antiSMASH. It might be interesting to use this option as these clusters often include Redundant genes and domains with other clusters of the genome, and thus might create non-unique id issues when creating an SQL database.

--gap-filling [=] <bool>

Tries to find domains if gaps present in clusters [default: 1].

--undef-cleaning [=] <bool>

Eliminates undef domains from antiSMASH output that can't be recovered [default: 1].

--undef-recov [=] <bool>

Try to recover antismash undef domain values [default: 0].

--evalue-threshold [=] <n>

E-value threshold to apply in HMMER searches [default: 1e-4].

--cpu [=] <n>

Number of threads/cpus to use [default: 1].

--version
--usage
--help
--man

print the usual program information

AUTHOR

Loic MEUNIER <lmeunier@uliege.be>

CONTRIBUTOR

Denis BAURAIN <denis.baurain@uliege.be>

COPYRIGHT AND LICENSE

This software is copyright (c) 2019 by University of Liege / Unit of Eukaryotic Phylogenomics / Loic MEUNIER and Denis BAURAIN.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.