Rutger Vos
and 1 contributors

NAME

Bio::Phylo::Forest::DBTree - Phylogenetic database as a tree object

SYNOPSIS

 use Bio::Phylo::Forest::DBTree;
 
 # connect to the Green Genes tree
 my $file = 'gg_13_5_otus_99_annotated.db';
 my $dbtree = Bio::Phylo::Forest::DBTree->connect($file);

 # $dbtree can be used as a Bio::Phylo::Forest::Tree object,
 # and the node objects that are returned can be used as
 # Bio::Phylo::Forest::Node objects
 my $root = $dbtree->get_root;

DESCRIPTION

This package provides the functionality to handle very large phylogenies (examples: the NCBI taxonomy, the Green Genes tree) as if they are Bio::Phylo tree objects, with all the possibilities for traversal, computation, serialization, and visualization, but stored in a SQLite database. These databases are single files, so that they can be easily shared. Some useful database files are available here: https://figshare.com/account/home#/projects/18808

To make new tree databases, a number of scripts are provided with the distribution of this package:

  • megatree-loader Loads a very large Newick tree into a database.

  • megatree-ncbi-loader Loads the NCBI taxonomy dump into a database.

  • megatree-phylotree-loader Loads a tree in the format of http://phylotree.org into a database.

As an example of interacting with a database tree, the script megatree-pruner can be used to extract subtrees from a database.

DATABASE METHODS

The following methods deal with the database as a whole: creating a new database, connecting to an existing one, persisting a tree in a database and extracting one as a mutable, in-memory object.

create()

Creates a SQLite database file in the provided location. Usage:

  use Bio::Phylo::Forest::DBTree;
  
  # second argument is optional
  Bio::Phylo::Forest::DBTree->create( $file, '/opt/local/bin/sqlite3' );

The first argument is the location where the database file is going to be created. The second argument is optional, and provides the location of the sqlite3 executable that is used to create the database. By default, the sqlite3 is simply found on the $PATH, but if it is installed in a non-standard location that location can be provided here. The database schema that is created corresponds to the following SQL statements:

 create table node(
   id int not null,
   parent int,
   left int,
   right int,
   name varchar(20),
   length float,
   height float,
   primary key(id)
 );
 create index parent_idx on node(parent);
 create index left_idx on node(left);
 create index right_idx on node(right);
 create index name_idx on node(name);

connect()

Connects to a SQLite database file, returns the connection as a Bio::Phylo::Forest::DBTree object. Usage:

 use Bio::Phylo::Forest::DBTree;
 my $dbtree = Bio::Phylo::Forest::DBTree->connect($file);

The argument is a file name. If the file exists, a DBD::SQLite database handle to that file is returned. If the file does not exist, a new database is created in that location, and subsequently the handle to that newly created database is returned. The creation of the database is handled by the create() method (see below).

persist()

Persist a phylogenetic tree object (a subclass of Bio::Phylo::Forest::Tree) into a newly created database file. Usage:

  use Bio::Phylo::Forest::DBTree;  
  my $dbtree = Bio::Phylo::Forest::DBTree->persist(
      -file => $file,
      -tree => $tree,
  );

This method first create a database at the location specified by $file by making a call to the create() method. Subsequently, the $tree object is traversed from root to tips and inserted in the newly created database. Finally, the handle to this database is returned, i.e. a Bio::Phylo::Forest::DBTree object.

extract()

Extracts a tree from a database. The returned tree is an in-memory object. Hence, this is an expensive operation that is best avoided as much as possible. Usage:

 my $tree = $dbtree->extract;

dbh()

Returns the underlying handle through which SQL statements can be executed directly on the database. This is a DBD::SQLite object. Usage:

 my $dbh = $dbtree->dbh;

TREE METHODS

The following methods are implemented here to override methods of the same name in the Bio::Phylo hierarchy so that the tree database is accessed more efficiently than otherwise would be the case.

get_root()

Returns the root of the tree, i.e. a Bio::Phylo::Forest::DBTree::Result::Node object, which is a subclass of Bio::Phylo::Forest::Node. Usage:

 my $root = $dbtree->get_root;

get_id()

Returns a dummy ID, an integer. Usage:

 my $id = $dbtree->get_id;

get_by_name()

Returns the first node object that has the provided name. Usage:

 my $node = $dbtree->get_by_name( 'Homo sapiens' );

visit()

Given a code reference, visits all the nodes in the tree and executes the code on the focal node. Usage:

 $dbtree->visit(sub{
     my $node = shift;
     print $node->name, "\n"; 
 });