NAME

monophylizer.pl - assesses taxonomic monophyly on Barcode of Life trees

SYNOPSIS

    # to get help on the command line
    $ perl monophylizer.pl --help
    
    # example run
    $ perl monophylizer.pl \
        -infile tree.nwk \
	-format newick \
	-astsv \
	-verbose \ 
            > outfile.tsv
 

OPTIONS AND ARGUMENTS

-infile <file>

A tree file, usually in Newick format. Required.

-format <newick|nexus|nexml|phyloxml>

Optional argument to specify the tree file format. By default the Newick format is used.

-metadata <file>

Optional argument to provide the location of a tab-separated spreadsheet with per-taxon metadata.

-separator <character>

Optional argument to specific the character that separates the taxon name from any additional metdata (such as sequence IDs) in leaf labels. By default this is the pipe symbol: '|'.

-comments

Optional flag to treat square brackets as opaque strings, not comments, default: true.

-quotes

Optional flag to treat quotes as opaque strings, default: true.

-whitespace

Optional flag to treat whitespace as opaque strings, default: true.

-trinomials

Optional flag to include subspecific epithets in taxa, default: false.

-astsv

Optional flag to set output as TSV regardless whether running as CGI, default: false. This is only available when running under CGI.

-verbose

Influences how verbose the script is. By default, only warning messages are emitted. When this flag is used once, also informational messages are emitted. When used twice, also debugging messages.

-help

Prints usage message and quits, only available when running on command line.

DESCRIPTION

This script assesses whether the species in a phylogenetic tree are monophyletic. It reports the output in a table that lists for each species its status (monophyletic, polyphyletic or paraphyletic), and if not monophyletic, which other species it is entangled with. In addition, sequence identifiers and any other input metadata are reported. How the assessment is made algorithmically is described below, in the section on the do_assessment subroutine.

SUBROUTINES

This section describes the various subroutines that implement the functionality of the script. This information is generally only of interest to developers.

main

The main subroutine takes the following steps:

- process_args:     get the input from CGI or command line
- make_taxa:        create a taxa block from the input tree's leaf labels
- read_spreadsheet: join the taxa with additional metadata, if any
- index_nodes:      applies pre- and post-order labels, aggregates "stop nodes"
- do_assessment:    assess whether stop node distribution implies mono/para/poly
- [print_html:      print output as HTML table (only under CGI without conneg)]
- [or print a tab-separated spreadsheet]
process_args

Processes command line or CGI arguments, returns a command object with argument fields. The script determines whether or not it is running under CGI by checking whether the hidden form field 'cgi' is set to a true value. The HTML page that composes the CGI request turns this flag on, so any other clients (e.g. cURL on the command line) need to ensure they do the same.

[Optional] arguments read from command line or CGI:

infile:      a tree file
[metadata:   a tab-separated spreadsheet with per-taxon metadata]
[format:     tree file format, default: 'newick']
[verbose:    verbosity level, default: WARN]
[separator:  symbol that separates leaf labels from other fields, default: '|']
[comments:   flag to treat square brackets as opaque strings, not comments, default: true]
[quotes:     flag to treat quotes as opaque strings, default: true]
[whitespace: flag to treat whitespace as opaque strings, default: true]
[trinomials: flag to include subspecific epithets in taxa, default: false]
[astsv:      emit output as TSV regardless whether running as CGI, default: false]
[help:       prints usage message and quits, only available when running on command line]

Returned command object fields:

tree:       a Bio::Phylo::Forest::Tree object
log:        a Bio::Phylo::Util::Logger object
factory:    a Bio::Phylo::Factory object 
metafh:     a file handle for the metadata spreadsheet, if any
separator:  symbol that separates leaf labels from other fields
trinomials: flag to include subspecific epithets in taxa
cgi:        flag to indicate whether we are running as CGI
astsv:      emit output as TSV regardless whether running as CGI
make_taxa

Given a Bio::Phylo::Forest::Tree object, makes a Bio::Phylo::Taxa object that contains all the distinct species names, with their ID annotations. The species names are extracted from the leaf labels by splitting these on a $separator token and taking the first segment, then splitting that on underscores or spaces to get the taxonomic name parts (e.g. genus, species, subspecies). If the $trinomials flag is set to true, three parts of the name are concatenated, otherwise two. For each distinct name that is created like this, a corresponding Bio::Phylo::Taxa::Taxon object is created, and each node whose label contains this name is linked to it. The taxon object aggregates all the sequence IDs (i.e. the part after $separator) into an array that it stores in the 'ids' field.

Arguments:

$tree:       input tree
$trinomials: if true, subspecific epithets are included in the taxon name
$separator:  the token on which to split the leaf label string
$fac:        factory object that creates taxa and taxon objects
read_spreadsheet

Given a Bio::Phylo::Taxa object and a file handle, reads the handle as a tab-separated spreadsheet (no header) whose first field is a name in the taxa object. Attaches all subsequent fields to the corresponding taxon.

Arguments:

$taxa:   a Bio::Phylo::Taxa object
$metafh: a file handle of a tab-separated spreadsheet
index_nodes

Given a tree, applies pre- and post-order indices and aggregates for each taxon all the nodes where it coalesces with at least one other species (so-called "stop nodes", cf. http://biophylo.blogspot.nl/2013/04/algorithm-for-distinguishing-polyphyly.html).

do_assessment

Given a Bio::Phylo::Taxa object and a Bio::Phylo::Forest::Tree object, iterates over all taxa. For each taxon, all the leaf nodes that belong to that taxon are collected. For this set of leaf nodes, assesses whether these form a monophyletic group. This is done by finding the MRCA of these leaf nodes, then fetch all descendants of the MRCA. If the set of descendants is the same side as the set of leaf nodes, the assessment for this taxon is monophyletic.

If the taxon is not monophyletic, the set of "stop nodes" is retrieved. These are all the internal nodes where the focal taxon coalesces with at least one other taxon, sorted in a post-order traversal. It then assesses for each stop node whether it descends from the next stop node in the sorted list. This is done by checking that the left (pre-order) index of the focal node is larger than that of the next, and the right (post-order) index of the focal node is smaller than the next. During this testing iteration, the stop nodes are binned in distinct paths from the tips to the root. If there's more than one distinct bin/path, the taxon is considered polyphyletic, otherwise paraphyletic.

For non-monophyletic taxa, the final step is to then collect all the other taxa with which the focal taxon is entangled. It does this by fetching for each bin/path the first (most recent) node and taking its subtended taxa. The union of these sets of subtended taxa forms the set of 'tanglees'.

Arguments:

$taxa: a Bio::Phylo::Taxa object
$tree: a Bio::Phylo::Forest::Tree object

Returns a two-dimensional array (i.e. a table) where each record consists of the fields:

- name of the focal taxon
- status, i.e. 'monophyletic', 'paraphyletic' or 'polyphyletic'
- comma separated string with 'tanglees', if any (otherwise empty string)
- comma separated string with sequence IDs
- comma separated string with additional metadata

Prints the result as an HTML page with table. The page expects that the JavaScript library "sorttable.js", which allows sorting on table columns, is located in the document root of the web server. The table has the following columns:

- name of the focal taxon
- status, i.e. 'monophyletic', 'paraphyletic' or 'polyphyletic'
- comma separated string with 'tanglees', if any (otherwise empty string)
- comma separated string with sequence IDs
- comma separated string with additional metadata