The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Physeter::Manual - User Guide for Physeter

VERSION

version 0.213470

Background

The aim of physeter.pl is to assess the level of contamination of any prokaryotic genome based on a BLASTX (Basic Local Alignment Search Tool) report and a taxonomic labeler (expressed in terms of NCBI Taxonomy).

Usage

Installation and dependencies

physeter.pl is written in Modern Perl but it relies on DIAMOND. See the link below to download and install it for your system.

http://www.diamondsearch.org/index.php

Most other dependencies can be handled automatically by using cpanm in a Perlbrew environment (https://perlbrew.pl/). Below are a set of commands to setup such an environment on Ubuntu.

    # install development tools
    $ sudo apt-get update
    $ sudo apt-get install build-essential

    # download the perlbrew installer...
    $ wget -O - http://install.perlbrew.pl | bash

    # initialize perlbrew
    $ source ~/perl5/perlbrew/etc/bashrc
    $ perlbrew init

    # search for a recent stable version of the perl interpreter
    $ perlbrew available
    # install the last even version (e.g., 5.24.x, 5.26.x, 5.28.x)
    # (this will take a while)
    $ perlbrew install perl-5.26.2
    # install cpanm (for Perl dependencies)
    $ perlbrew install-cpanm

    # enable the just-installed version
    $ perlbrew list
    $ perlbrew switch perl-5.26.2

    # make perlbrew always available
    # if using bash (be sure to use double >> to append)
    $ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.bashrc
    # if using zsh  (only the destination file changes)
    $ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.zshrc

Major physeter.pl dependencies are the Bio::MUST series of modules. Install them as follows.

    $ cpanm Bio::FastParsers
    $ cpanm Bio::MUST::Core
    $ cpanm Bio::MUST::Drivers

If errors occur during installation, use --force and/or --notest options of cpanm.

    $ cpanm --force Bio::MUST::Drivers

Install physeter.pl.

    $ cpanm Bio::MUST::Apps::Physeter

Finally install a local mirror of the NCBI Taxonomy. It will be used by physeter.pl to taxonomically affiliate the main organism and the possible contaminant organisms.

    $ setup-taxdir.pl --taxdir=taxdump/

Input files

Building the DIAMOND database

In order to generate BLASTX report file needed by physeter.pl, you have to set up a DIAMOND database. First, download prokaryotic protein files from NCBI RefSeq. As of October 2020, it requires at least 220 GB of storage.

    $ rsync -avz --files-from=genome.idl \
        --no-R rsync://ftp.ncbi.nlm.nih.gov/genomes/all/ database/
    $ gunzip database/*.faa.gz

Then, you have to rename sequence identifiers to hold NCBI GCA/GCF accessions followed by a separator (|) and by a protein accession number (e.g., GCF_000003135.1|WP_001260374.1). This can be done using inst-abbr-ids.pl from the Bio::MUST::Core distribution.

    $ ls -U database/*.faa | perl -nle 'm/(GCF\_\d{9}\.\d+)/; print "$_\t$1"' \
        > abbr.idm

Your abbr.idm should look like this:

    $ head -n5 file.idm
    
    GCF_003144445.1_ASM314444v1_protein.faa     GCF_003144445.1
    GCF_900129705.1_IMG-taxon_2695420922_annotated_assembly_protein.faa GCF_900129705.1
    GCF_002037065.1_ASM203706v1_protein.faa     GCF_002037065.1
    GCF_001941425.1_ASM194142v1_protein.faa     GCF_001941425.1
    GCF_900468105.1_AFS031577_protein.faa       GCF_900468105.1
    
    $ inst-abbr-ids.pl *.faa --id-regex=:DEF --id-prefix-mapper=abbr.idm \
        --outdir=./renamed-db

Finally, build the DIAMOND database.

    $ cat renamed-db/*.faa > database.faa
    $ diamond makedb --in database.faa -d database

Running DIAMOND BLASTX

Before running DIAMOND, you have to transform the prokaryotic genome files you want to assess into pseudo-read FASTA files. Use inst-split-fas.pl from the Bio::MUST::Core distribution to do so. In the example below, the genome will be split into 250-base long pseudo-read sequences without overlap. If your genome has a NCBI GCA/GCF accession, name your outfile assembly_accession.fasta (e.g., GCF_000006605.1.fasta).

    $ inst-split-fas.pl genome.fasta --outfile=split-genome.fasta \
        --chunk=250 --step=250

Then run DIAMOND as follows. Like the FASTA file, name your BLASTX report as assembly_accession.blastx (e.g., GCF_000006605.1.blastx). If your genome file does not have a NCBI GCA/GCF accession, both the FASTA file and the BLASTX report must have the same basename. The -f tab option of DIAMOND will generate a tab-separated file corresponding to the -outfmt 6 of regular NBCI-BLAST+. You can adapt the -p 10 option (number of CPU threads) to suit your system.

    $ diamond blastx -d database -q split-genome.fasta -o split-genome.blastx \
        -t ./temp -k 50 -e 1e-10 -f tab -p 10

Taxonomic labeler

A taxonomic labeler is used by physeter.pl to determine at which taxonomic level you consider a pseudo-read sequence as a contaminant. See examples below:

    $ head phylum-taxa.idl
    
    unclassified Bacteria
    unclassified Archaea
    Abditibacteriota
    Acidithiobacillia
    Acidobacteria
    Actinobacteria
    Alphaproteobacteria
    Aquificae
    Armatimonadetes
    Bacteroidetes

In this example, taxonomic levels are phyla and therefore physeter.pl will be able to detect inter-phylum contaminations.

Command-line options of physeter.pl

Classic mode

Once all input files are correctly prepared, you can simply run physeter.pl like this:

    $ physeter.pl *.blastx --outfile=contam.report --taxdir=taxdump/ \
        --taxon-list=phylum-taxa.idl

The standard output file of physeter.pl is a tab-separated file containing the following sections: (1) organism accession or file name, (2) assigned taxon, (3) % self sequences, (4) % contaminated sequences, (5) % unknown taxon sequences, (6) % unclassified sequences, (7) detail of contaminants, (8) mean number of hits used to classify the pseudo-read sequences.

In addition to the Physeter output file, you can generate for each assayed genome a Kraken-like file, an Anvio-like file, a Krona-compatible file or a LCA (Last Common Ancestor) file, the latter providing the taxonomic affiliation of each pseudo-read.

    $ physeter.pl *.blastx --outfile=contam.report --taxdir=taxdump/ \
        --taxon-list=phylum-taxa.idl --kraken --anvio --krona --lca

When your pseudo-read FASTA files are not in the working directory, you can specify their localization using the --fasta-dir option.

    $ physeter.pl *.blastx --outfile=contam.report --fasta-dir=split-fasta/ \
        --taxdir=taxdump/ --taxon-list=phylum-taxa.idl

If your organism does not have a NCBI GCA/GCF accession but you know approximately its taxonomy, you can specify it with the --exp-tax option. Note that the specified taxon must be listed in the file provided through the --taxon-list option.

    $ physeter.pl organism.blastx --exp-tax=Firmicutes --outfile=contam.report \
        --taxdir=taxdump/ --taxon-list=phylum-taxa.idl

Otherwise, use the --auto-detect option.

    $ physeter.pl organism.blastx --auto-detect --outfile=contam.report \
        --taxdir=taxdump/ --taxon-list=phylum-taxa.idl

In the basic configuration, physeter.pl will assess the contamination status of a pseudo-read sequence using only 1 hit (i.e., best-hit mode). If you want to use more than 1 hit (i.e., MEGAN-like mode), you can use the --tax-min-hits and --tax-max-hits options. In the MEGAN-like mode, a LCA will be inferred for each pseudo-read sequence.

    $ physeter.pl *.blastx --outfile=contam.report --taxdir=taxdump/ \
        --taxon-list=phylum-taxa.idl --tax-min-hits=2 --tax-max-hits=50

You can use --tax-score-mul and --tax-min-lca-freq options to fine tune LCA inference.

    $ physeter.pl *.blastx --outfile=contam.report --taxdir=taxdump/ \
        --taxon-list=phylum-taxa.idl --tax-min-hits=2 --tax-max-hits=50 \
        --tax-score-mul=0.7 --tax-min-lca-freq=0.85

Other options can be applied to filter the BLASTX hits used for contamination assessment. Those are --tax-min-ident, --tax-min-len and --tax-min-score.

k-fold mode

The last functionality of physeter.pl is the k-fold mode. In this mode, the DIAMOND database is randomly split into 10 subsets. Then, physeter.pl runs 10 times and, for each run, hits from one of the subsets are ignored. The results of the 10 analyses are written in the standard output file. None of the Kraken-like file, Anvio-like file, Krona-coompatible file and LCA file are available when running in k-fold mode.

    $ physeter.pl *.blastx --outfile=contam.report --taxdir=taxdump/ \
        --taxon-list=phylum-taxa.idl --tax-min-hits=2 --tax-max-hits=50 \
        --k-fold=database.gca

The database.gca file is the list of all NCBI GCA/GCF accessions of the genomes used to build the DIAMOND database.

    $ cut -f2 abbr.idm > database.gca
    $ head database.gca
    
    GCF_003144445.1
    GCF_900129705.1
    GCF_002037065.1
    GCF_001941425.1
    GCF_900468105.1
    GCF_005886055.1
    GCF_002043285.1
    GCF_004011355.1
    GCF_000267825.2
    GCF_001661025.1

AUTHOR

Denis BAURAIN <denis.baurain@uliege.be>

CONTRIBUTORS

  • Valerian LUPO <valerian.lupo@doct.uliege.be>

  • Mick VAN VLIERBERGHE <mvanvlierberghe@doct.uliege.be>

  • Luc CORNET <luc.cornet@uliege.be>

COPYRIGHT AND LICENSE

This software is copyright (c) 2020 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.