The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::ObjectCompat - Object compatibility for phylogenetic software in OO perl

Object compatibility for phylogenetic software in OO perl

 Rutger A. Vos
 rvos@interchange.ubc.ca
 Department of Zoology, 6270 University Boulevard
 University of British Columbia
 Vancouver, BC, V6T 1Z4, Canada

The most recent version of this document can be found at (user=guest, pass=guest):

 $URL: http://nladr-cvs.sdsc.edu/svn/CIPRES/cipresdev/trunk/cipres/framework/perl/phylo/lib/Bio/ObjectCompat.pod $

The trunk version of this document is written in pod, a simple source code documentation format for perl5. To view it in an nroff-like formatter, use 'perldoc ObjectCompat.pod'. Pod can be converted to a number of different formats; by default the pod2text, pod2latex and pod2html utilities should be available for this purpose on systems with a recent perl installation.

The version you are reading now is: $Revision: 3409 $

Please help improve this document by making sure you are reading the most recent version, and sharing your feedback with the author.

Abstract

This document describes the steps required to obtain object compatibility between three software packages written in object-oriented perl5: Bio::Perl, Bio::NEXUS and Bio::Phylo. Of these three, BioPerl is by far the most commonly used, largest and oldest project. We therefore suggest an approach that requires minimal, optional changes on its part, playing to the strength of its design in using interfaces such as Bio::Tree::TreeI and Bio::Tree::NodeI. We are implementing several new such interfaces, in particular for characters or character sequences, character state matrices and a character-data-and-tree object that forms a container for comparative data and phylogenetic trees. Implementation of these interfaces is largely left to Bio::NEXUS and Bio::Phylo, which thereby become compatible, such that users can draw on the strengths of both packages more easily.

Introduction

Phylogenetic analysis is a field that, from a programmer's perspective, deals with a limited set of objects: trees which are comprised of nodes, matrices which are comprised of character sequences of some sort, and a containing context to describe the relationship between the two: a character-data-and-tree object.

Object-oriented perl5

Objects in perl5 are references to data structures 'blessed into' a package, which defines the methods implemented by the object. Perl5 allows for multiple inheritance either by using the base pragma or by manipulating the @ISA array. Runtime modification of the inheritance tree and the symbol table allows for optional implementation of java-like interfaces, so that classes from different packages can become loosely coupled through the interfaces they implement. These properties can be used to make different packages written in object-oriented perl5 object-compatible.

Phylogenetic software packages

Several software libraries written in object-oriented perl5 now exist that all implement objects from the phylogenetic problem space - though all in slightly different ways. The largest among these packages is Bio::Perl, which is widely used by molecular biologists around the world. BioPerl's architecture is broad, with branches being maintained by many different developers who maintain compatibility with each other by implementing interfaces such as Bio::Tree::TreeI, Bio::Tree::NodeI (see also: http://search.cpan.org/~birney/bioperl-1.4/biodesign.pod). Here we will describe how two smaller packages, Bio::NEXUS and Bio::Phylo can be modified to become compatible with BioPerl so that their respective strengths become more easily accessible to the BioPerl user community. The approach we suggest may be a model for other phylogenetic software written in OO perl5, with BioPerl taking on the role of defining the standard interfaces - a kind of W3C for phyloinformatics.

Interface conventions in BioPerl

The typical approach taken in BioPerl is that java-like interfaces are defined in classes whose name are suffixed with an 'I', e.g. Bio::Tree::TreeI. These classes inherit from Bio::Root::RootI, which defines exception handling methods.

The interfaces are never instantiated directly. Rather, the implementation class objects such as Bio::Tree::Tree are instantiated by the IO system, in this case Bio::TreeIO.

The interfaces define method names to be implemented, throwing throw_not_implemented exceptions when the code blocks are ever executed. Classes in BioPerl such as Bio::Tree::Tree implement the actual subroutines defined in the interfaces they contain in their @ISA arrays, in this case Bio::Tree::TreeI, thereby preventing these exceptions from ever being thrown.

BioPerl's general design philosophy is that "complex" operations (generally, anything that is computationally intensive and/or requires external tools) are provided by separate factory classes that operate on the objects. The basic objects modelling biological data (trees, matrices) are therefore intentionally fairly concise.

Optional interface inheritance

Third-party packages can become compatible with BioPerl by defining using base which BioPerl interfaces they implement (and then correctly implementing the methods defined in the interface). However, this creates a permanent compile time dependency between it and BioPerl. A more dynamic option is by testing at runtime whether an interface is installed, and only then inheriting from it by including the class in the @ISA array.

I (RAV) found that in many instances the interface defined methods only differ slightly from those implemented natively by the Bio::Phylo classes (e.g. return values passed as a list versus an array reference), so implementing adaptor classes to create object compatibility with bioperl was fairly straightforward - as shown in the Bio::Phylo::Adaptor architecture.

The Bio::NEXUS::Tree and Bio::NEXUS::Node object could be modified in a similar way, such that tree objects and node objects from Bio::NEXUS can similarly masquerade as BioPerl objects.

Further integration

Bio::NEXUS and Bio::Phylo can integrate further along three tracks:

1. Input and output

All three packages now have their own IO architecture. A wrapper class that functions like BioPerl's IO architecture should be written. The IO class/object sets up a character-data-and-tree architecture, where the actual data objects - trees, matrices, taxa - are instantiated by file parsers, database interfaces and cipres interfaces provided by the various toolkits. Work is now well under way to make Bio::AlignIO::nexus use Bio::NEXUS as its parser, and Bio::Phylo will probably do the same thing in the future.

2. Internal code reviews

The code bases of Bio::NEXUS and Bio::Phylo should be reviewed to minimize the number of locations where assumptions are made about the underlying data structure (or indeed any API idiosyncracies), instead using the advertised interface accessors and mutators as much as possible. This will facilitate tighter integration of objects in the future. Ideally, objects themselves should be as abstract as possible in terms of their instance data, handing over much of their relational context to a mediator. For example, the different objects in Bio::Phylo that interact with OTUs manage these relationships Bio::Phylo::Mediators::TaxaMediator, whose back end can now easily be replaced with a persistent data source or repository such as a webservice or a database. Ideally, this concept would be expanded to include relationships between nodes in a tree, other instance data, and higher order relationships between sets of OTUs, matrices and trees.

3. New interfaces

In addition to the node and tree interfaces currently defined in BioPerl a number of new interfaces should be specified: an abstract character state matrix interface; a character, or character sequence interface supporting various data types; and a character-data-and-tree interface linking tree objects with matrix objects. Work in this direction has been done by extending Bio::SimpleAlign as both a Bio::AnnotatableI and a Bio::FeatureHolderI. As well as the addition of Bio::Annotations::TreeI (and a test script) to bioperl, so a Bio::NEXUS::Tree or Bio::Tree can be attached to an alignment as an "annotation".

The next section discusses these interfaces in more detail.

New Interfaces

The interfaces we propose are meant to be fairly minimal, providing mostly just accessors and mutators for the object's data. Substantial operations (e.g. calculations) will be provided by factory objects. For example, inferring a tree would be something like:

 my $inferrer = Bio::Tools::InferTree::FooBar->new;
 my $tree = $inferrer->inferTree( $matrix );

Rather than:

 my $tree = $matrix->inferTree;

Matrices

At present, no suitable interface for character state matrices has been defined in BioPerl. However, having a Bio::Phylo::Matrices::Matrix masquerade as a Bio::Align::AlignI instance well enough that it is written as proper #nexus without too much trouble, as shown in Bio::Phylo::Adaptor::Bioperl::Matrix. But a character state matrix object can be many other things besides an alignment. The #nexus format specifies many other data types (categorical, continuous values) which should also be validated.

1. Matrix type safety

A character state matrix has a pre-defined data type (dna/rna/nucleotide; amino acid; standard categorical; continuous) against which data inserted in the matrix must be validated. Once data has been inserted in the matrix there is little point in changing the datatype, so perhaps this should be a constant specified in the constructor, so that subsequently the interface only defines a readonly $matrix->datatype() method. Likewise, the number of taxa and characters in a matrix should be an emergent property of its contents so the $matrix->ntax() and $matrix->nchar() methods should be readonly.

In a character state matrix, some symbols may be more ambiguous than others - most sequence alignments have gaps in them, and sometimes the sequences are just bad, with many N's or ?'s. Under the IUPAC single character ambiguity conventions, ambiguous symbols map to non-ambiguous ones as follows:

 my $IUPAC = {
    'A' => [ 'A'             ],
    'B' => [ 'C','G','T'     ],
    'C' => [ 'C'             ],
    'D' => [ 'A','G','T'     ],
    'G' => [ 'G'             ],
    'H' => [ 'A','C','T'     ],
    'K' => [ 'G','T'         ],
    'M' => [ 'A','C'         ],
    'N' => [ 'A','C','G','T' ],
    'R' => [ 'A','G'         ],
    'S' => [ 'C','G'         ],
    'T' => [ 'T'             ],
    'U' => [ 'U'             ],
    'V' => [ 'A','C','G'     ],
    'W' => [ 'A','T'         ],
    'X' => [ 'A','C','G','T' ],
    'Y' => [ 'C','T'         ],
    '-' => [                 ],
    '?' => [ 'A','C','G','T' ],
 };

The matrix interface should be able to take this ambiguity into account when parsing matrices, or when transforming them, for example for serialization to the CIPRES architecture.

To allow for this during validation of character $c a character state lookup should be performed, such as by checking the $IUPAC hash reference. If $matrix->datatype =~ /^dna$/i it means that the $IUPAC hash reference is the lookup table. If not exists $IUPAC->{$c} an exception is thrown.

For instances where none of the default lookup tables suffice (i.e. when handling a 'mixed' matrix) the matrix interface should allow a lookup table as an argument to the constructor.

2. Matrices and the Character Data and Tree concept

A character matrix can become contained by a CDAT object, analogous to the way mesquite defines a project (using the title and link tokens, or possibly just by allowing only one taxa block, one tree block and one characters block to be in context at any one time). This facility may be defined as in Bio::Phylo::Matrices::Matrix, using $matrix->set_cdat($cdat) and $matrix->get_cdat() methods, or just from the perspective of the CDAT container, e.g. $cdat->add_matrix($matrix). Or using a mediator architecture that manages the bi-directional relationships between the objects involved.

Character sequences

BioPerl does not define a suitable interface for character sequences. We propose a character sequence interface that meets the following requirements:

1. Range operations

Individual objects for each character in a matrix are not feasible from a performance and memory requirements point of view. Instead, character state data should be defined in ranges, perhaps inheriting from Bio::RangeI.

2. Character type safety

Like the character state matrix interface, the character sequence interface must be typed (e.g. dna/rna/nucleotide; protein; standard categorical or continuous), so that characters inserted in the character sequence object can be validated, and character sequence objects inserted in the matrix object can be checked for type identity with the matrix object. The data type may be defined using $char->set_type($type) and $char->get_type() methods.

3. Character-to-CDAT linkage

Character sequence objects are contained by matrix objects, which in turn can be contained / handled by CDAT objects.

4. Meta data

The character sequence object should allow for annotation of individual characters, for example as implemented in Bio::Phylo::Matrices::Datum.

Character-data-and-tree

Conceptually, nodes in phylogenetic trees and character sequences in matrices both refer to biological entities (e.g. OTUs). We want to make this relationship explicit by creating an intersection object that links the two. The CDAT object would be a thin wrapper around the more fine grained BioPerl objects (Bio::Tree::TreeI and Bio::CDAT::CharMatrixI) it contains. This CDAT object must meet the following requirements:

1. CDAT-to-node linkage

The CDAT object must be able to contain one or more Bio::Tree::TreeI objects, e.g. using $cdat->set_tree($tree) and $cdat->get_trees() (and perhaps $cdat->remove_trees($tree)).

2. CDAT-to-character sequence linkage

The CDAT object must be able to contain one or more Bio::CDAT::CharMatrixI objects, e.g. using $cdat->set_matrices($matrix) and $cdat->get_matrices($matrix) (and perhaps $cdat->remove_matrices($matrix)) methods.

We suggest as a namespace Bio::CDAT.

TODO list summary

IO

Bio::NEXUS, Bio::Phylo and BioPerl should become better integrated at the input/output level, for example by adopting the standard BioPerl architectures for parsers (e.g. Bio::TreeIO), and by making trees received from CIPRES conform to the BioPerl interfaces.

Test data

In order to ensure quality coding, we should adopt a set of test data files and a regression testing strategy. This is likely to develop out of the use cases.

CPAN release cycles

The intent is that the design phase takes place on cpan releases of Bio::NEXUS and Bio::Phylo, and only once the API has stabilized changes to the BioPerl core will be proposed.