Ewan Birney

NAME

Bioperl databases 1.3 - how to use databases with Bioperl

SYNOPSIS

Not really appropiate for a synopsis. Read on...

DESCRIPTION

This document is designed to let you use Bioperl with databases. Bioperl can work with a number of sequence formats and allows users to interconvert easily amongst these formats. Bioperl can also index flat files in some of these standard formats, allowing very fast retrieval. Other layers above flat file indexing are provided, allowing users to retrieve sequences from remote databases via the Web or build and query in-house relational databases (RDBs).

Different scripts are provided to get you started with the Bioperl backend, for example:

scripts/bp_fetch.PLS

Fetches sequences from a Database

scripts/bp_index.PLS

Builds indexes for flat files databases which are easily accessible by bp_fetch.PLS. These scripts are discussed in more detail below.

The core of the backend system is found in following modules

Bio::DB::*

Generic access to databases, whether flat file, web or RDB. At the moment, this provides random access retrieval, on the basis of ids or accession numbers, but does not provide the ability to loop over the entire database, nor does it provide any complex querying ability. For these advanced features see USING BIOPERL WITH A RELATIONAL DATABASE (bioperl-db).

Bio::DB::BioSeqI is the abstract interface for the databases (hence the I). Bio::DB::GenBank and Bio::DB::GenPept are concrete implementations for network access to the GenBank and GenPept databases at NCBI, and others, using HTTP as a protocol. See Bio::DB::GenBank, Bio::DB::GenPept, and "REMOTE SEQUENCE RETRIEVAL (Bio::DB::*)" for more information.

Bio::Index::* and Bio::DB::Fasta

Flat file indexing system, for read-only, flat file distributions. These provide for format-specific, generic type access, but the underlying machinery can be customised for any number of different flat file systems. The indices allow extremely fast retrieval based on unique keys.

The Index modules EMBL and Fasta, as they are designed as sequence databases, conform to the Bio::DB::BioSeqI interface, meaning they can be used whereever the Bio::DB::BioSeqI is expected.

Flat file indexing of Fasta files is also provided by Bio::DB::Fasta, please see Bio::DB::Fasta for more information - this module has some useful features not contained in Bio::Index::Fasta.

Bio::SeqIO::*

Conversion systems for Bio::Seq objects, either to or from sequence streams. The movement of things into SeqIO prevents the Bio::Seq object bloating up with format code, and the SeqIO system has the benefit of being very easy to extend to new formats. Please see Bio::SeqIO for more information.

bioperl-db

The bioperl-db package is installed separately from Bioperl but is tightly integrated with Bioperl's objects. The bioperl-db package works with the BioSQL schema and maps to Bioperl's Seq and SeqFeature objects, and sequences in familiar formats can be loaded into the schema in relational form, and retrieved as Seq objects as well. Currently Mysql, Oracle, and Postgres are supported. See USING BIOPERL WITH A RELATIONAL DATABASE (bioperl-db) below.

REMOTE SEQUENCE RETRIEVAL (Bio::DB::*)

Bioperl offers a number of modules for retrieving sequences over the network. The available remote databases include GenBank, GenPept, GDB, EMBL, SwissProt, XEMBL, and remote Ace servers. A typical method is

   get_Seq_by_id($id)

which returns a Seq object, or

   get_Stream_by_id($ref_to_array_of_ids)

which returns a SeqIO object. See Bio::DB::GenBank, Bio::DB::GenPept, Bio::DB::GDB, Bio::DB::EMBL, Bio::DB::SwissProt, Bio::DB::XEMBL, and Bio::DB::Ace, and Bio::SeqIO for more information.

SETTING UP BIOPERL INDICES (Bio::Index::*)

If you want to use Bioperl indicies of Fasta, EMBL/SwissProt .dat files, SwissPfam, GenBank, or Blast files then the bp_fetch.PLS and bp_index.PLS scripts are great ways to start off (and also reading the scripts shows you how to use the Bioperl indexing stuff). bp_fetch.PLS and bp_index.PLS coordinate using two environment variables

  BIOPERL_INDEX - directory where the indices are kept

  BIOPERL_INDEX_TYPE - type of DBM file to use for the index

The basic way of indexing a database, once BIOPERL_INDEX has been set up, is to go

  bp_index.pl <index-name> <filenames as full path>

e.g., for Fasta files

  bp_index.pl est /nfs/somewhere/fastafiles/est*.fa

Or, for EMBL/Swissprot files

  bp_index.pl -fmt=EMBL swiss /nfs/somewhere/swiss/swissprot.dat

To retrieve sequences from the index go

  bp_fetch.pl <index-name>:<id>

eg,

  bp_fetch.pl est:AA01234

or

  bp_fetch.pl swiss:VAV_HUMAN

bp_fetch.pl also has other options to connect to Genbank across the network.

CHECKLIST

   mkdir /nfs/datadisk/bioperlindex/

or any other directory

   setenv BIOPERL_INDEX /nfs/datadisk/bioperlindex/
   setenv BIOPERL_INDEX_TYPE DB_File

in .cshrc or .tcshrc (or set and export in bash and its .bashrc). Another BIOPERL_INDEX_TYPE is SDBM_File, this one should come with a standard Perl package.

go

   bp_index.pl swissprot /nfs/datadisk/swiss/swissprot.dat

etc. You are now ready to use bp_fetch.pl. See Bio::Index::Fasta, Bio::Index::GenBank, Bio::Index::Blast, Bio::Index::EMBL, Bio::Index::SwissPfam, and Bio::Index::Swissprot for more.

Flat file indexing of Fasta files is also provided by Bio::DB::Fasta, please see Bio::DB::Fasta for more information - this module provides some functionality absent from Bio::Index::Fasta.

USING BIOPERL WITH A RELATIONAL DATABASE (bioperl-db)

The bioperl-db package works in conjunction with the BioSQL relational schema (http://obda.open-bio.org/). The most recent version of the bioperl-db package can be obtained at http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/?cvsroot=bioperl.

The bioperl-db and BioSQL packages integrate neatly with Bioperl's objects and contain tables for sequences, entries, sequence features, taxonomic information, references, keywords, a wide variety of genetic markers, and more. With a Seq object and a built-in database "adaptor" one can load the object's data in one step, as in

  $adaptor->store($newid,$seqobj)

Similarly, one can query the database with id's or using query objects and retrieve arrays of Seq objects. For example

  @seqobjs = $adaptor->fetch_by_query( -constraints =>
            ["or", 
                  ["and", # find all human topoisomerases
                         "species=Human",
                         "keywords like *topoisomerase*"],
                  ["and", # and Escherichia gyrases
                         "species like Escherichia*",
                         "keywords like *gyrase*"],
            ]);

See Bio::DB::SQL::SeqAdaptor, Bio::DB::SQL::QueryConstraint, Bio::DB::SQL::CommentAdaptor, and Bio::DB::SQL::BioQuery for more examples.

With bioperl-db and BioSQL installed you will also be able to load sequence data, GO Ontology data, and NCBI taxonomic data into your own RDB.

BIOPERL AND BIOSQL

BioSQL is joint project of the OpenBio efforts: bioperl, biopython, biojava, and bioruby. BioSQL creates a single relational schema that's accessible using any of these packages. To use BioSQL with Perl you'll need Bioperl, version 1.2 or later, bioperl-db from www.bioperl.org, and BioSQL. The BioSQL package can be obtained at www.open-bio.org. BioSQL currently supports the Mysql, Postgres, and Oracle database servers. Consult the BioSQL package for more details on installing and using BioSQL.

BIO::DB::GFF: INDEXING AND RETRIEVAL OF SEQUENCE FEATURE DATA

An alternative view of sequence feature data is provided by the Bio::DB::GFF module, part of the core package. This module takes a set of sequence features stored in GFF (gene-finding format) format and loads them into a relational database optimized for positional queries. This allows a variety of data mining operations, such as finding sequence features that are within a certain distance of each other or which overlap. The module can also store large contiguous sequences and extract subsequences rapidly.

See Bio::DB::GFF, Bio::DB::GFF::RelSegment, Bio::DB::GFF::Feature, and Bio::DB::GFF::Adaptor::dbi.