NAME

Bio::ToolBox::db_helper - helper interface to various database formats

DESCRIPTION

These are helper subroutines to work with relevant databases that can be accessed through BioPerl modules. These include the Bio::DB::SeqFeature::Store relational database as well as Bio::DB::Fasta, Bio::DB::Sam, Bio::DB::HTS, Bio::DB::BigWig, Bio::DB::BigWigSet, and Bio::DB::BigBed databases. The primary functions included opening connections to these databases, verifying features found within the database, generating lists of features or genomic intervals for subsequent analysis, and collecting features and/or scores within regions of interest. Collected scores may be summarized using a variety of statistical methods.

When collecting scores or features, the data may hosted in a variety of formats and locations. These include:

SeqFeature::Store Database

Full features may be stored, including genes, transcripts, exons, etc. Simple datasets, such as from microarray, may also be stored as the score value in the source GFF file.

References to local, binary, indexed files may also be included as attributes to features stored in the database. Supported files include binary wig files (.wib, see Bio::Graphics::Wiggle) using the 'wigfile' attribute, or bigWig files using the 'wigfile' or 'bigwigfile' attribute. The attribute values must be full paths.

SeqFeature::Store databases are usually hosted by a relational database server (MySQL or PostGreSQL), SQLite file, or an in-memory database (for small GFF3 files only).

BigWig files

BigWig files are compressed, binary, indexed versions of text wig files and may be accessed either locally or remotely. They support extremely fast score retrieval from any genomic location of any size without sacrificing resolution (spatial and numeric).

Directory of BigWig files

A directory containing one or more BigWig files is assembled into a BigWigSet, allowing for metadata, such as strand, to be associated with BigWig files.

BigBed files

BigBed files are compressed, binary, indexed versions of text BED files and may be accessed either locally or remotely. They support extremely fast score and feature retrieval from any genomic location.

Bam files

Bam files are compressed, binary, indexed versions of the text SAM file, or sequence alignment map. They are used with next generation sequencing technologies. They support individual alignment retrieval as well as read depth coverage.

USeq files

USeq files are compressed, binary, indexed files that support BED type annotations or wig type scores distributed across the genome. They support rapid, random access across the genome and are comparable to both BigWig and BigBed files.

While the functions within this library may appear to be simply a rehashing of the methods and functions in Bio::DB::SeqFeature::Store and other modules, they either provide a simpler function to often used database methodologies or are designed to work intimately with the BioToolBox data format file and data structures (see Bio::ToolBox::file_helper). One key advantage to these functions is the ability to work with datasets that are stranded (transcriptome data, for example).

Historically, this module was initially written to use Bio::DB::GFF for database usage. It has since been re-written to use Bio::DB::SeqFeature::Store database schema. Additional functionality has also been added for other databases and file formats.

Complete usage and examples for the functions are provided below.

USAGE

Call the module at the beginning of your perl script and include the module names to export. None are exported by default.

  use Bio::ToolBox::db_helper qw(
          get_new_feature_list 
          get_db_feature
  );

This will export the indicated subroutine names into the current namespace. Their usage is detailed below.

EXPORTED SUBROUTINES

open_db_connection($database, $no_cache)

This module will open a connection to a BioPerl style database. It returns an object that represents the connection. Several different types of databases are supported.

Bio::DB::SeqFeature::Store database

These may be represented by a relational database (e.g. MySQL database), a SQLite database file (file.sqlite or file.db), or a single GFF3 file (file.gff) that can be loaded into an in-memory database. In-memory databases should only be used with small files as they demand a lot of memory.

Parameters for connecting to a relational database are stored in the BioToolBox configuration file, .biotoolbox.cfg. These include database adaptors, user name, password, etc. Information regarding the configuration file may be found within the file itself.

Bio::DB::HTS database

A modern adapter interface to the Bam (or Cram) alignment file comprising of short sequence read alignments to a genome. This is based on the modern HTSlib library, successor to the original samtools library. See http://samtools.github.io for more information.

Bio::DB::Sam database

An older interface to the Bam alignment file comprising of short sequence read alignments to a genome. This is based on the samtools C library version <= 0.1.19. This is still supported although the above should be used for new installs.

Bio::DB::BigWig database

A self-contained database of scores represented by a BigWig (file.bw). See http://genome.ucsc.edu/goldenPath/help/bigWig.html for more information. Files may be either local or remote (prefixed with http:// or ftp://).

Bio::DB::BigWigSet database

A local or remote directory of one or more BigWig files that can treated collectively as a database. A special text file may be included in the directory to assign metadata to each BigWig file, including attributes such as type, source, display name, strand, etc. See Bio::DB::BigWigSet for more information on the formatting of the metadata file.

Bio::DB::BigBed database

A self-contained database of regions represented by a BigBed (file.bb). See http://genome.ucsc.edu/goldenPath/help/bigBed.html for more information. Files may be either local or remote (prefixed with http:// or ftp://).

Bio::DB::USeq database

A self-contained database file of indexed regions or scores represented by a useq archive file (file.useq). See http://useq.sourceforge.net/useqArchiveFormat.html for more information. Files may only be local.

Bio::DB::HTS::Faidx database

An indexed genomic multi-fasta adapter for rapidly fetching genomic sequence using the HTSlib library. If the fasta is not indexed with a .fai file, it will be automatically indexed upon opening.

Bio::DB::Sam::Fai database

An indexed genomic multi-fasta adapter for rapidly fetching genomic sequence using the samtools C library. If the fasta is not indexed with a .fai file, it will be automatically indexed upon opening.

Bio::DB::Fasta Database

A database of fasta sequences. A single multi-fasta or a directory of fasta files may be specified. The directory or parent directory must be writeable by the user to write a small index file.

Pass the name of a relational database or the path of the database file to the subroutine. The opened database object is returned. If it fails, then an error message should be generated and nothing is returned.

Important! If you are forking your perl process, always re-open your database objects in the child process, and pass a second true value to avoid using the cached database object. By default, opened databases are cached to improve efficiency, but this will be disastrous when crossing forks.

Example:

        my $db_name = 'cerevisiae';
        my $db = open_db_connection($db_name);
        
        my $file = '/path/to/file.bam';
        my $db = open_db_connection($file);
        
        # within a forked child process
        # reusing the same variable to simplify code
        # pass second true value to avoid cache
        $db = open_db_connection($file, 1); 
get_db_name($db)

This subroutine will attempt to get the name of an opened Database object if for some reason it's unknown, i.e. user only provided an opened db object. Only works for some databases, at least those I've bothered to track down and find a usable API call to use.

get_dataset_list($db)

This subroutine will retrieve a list of the available features stored in the database and returns an array of the features' types.

For Bio::DB::SeqFeature::Store databases, the type is represented as "type:source", corresponding to the third and second GFF columns, respectively. The types are sorted alphabetically first by source, then by method.

For Bio::DB::BigWigSet databases, the type, primary_tag, method, or display_name attribute may be used, in that respective order of availability. The list is sorted alphabetically.

Pass either the name of the database or an established database object. Supported databases include both Bio::DB::SeqFeature::Store and Bio::DB::BigWigSet databases.

Example:

        my $db_name = 'cerevisiae';
        my @types = get_dataset_list($db_name);
verify_or_request_feature_types(%options)

This subroutine will process a list of feature types or data sources to be used for data or feature collection. There are two modes of action. If a list was provided, it will process and verify the list. If no list was provided, then the user will be prompted to select from a list of available feature types in the provided database.

The provided list may be a list of feature types or a list of single file data sources, including Bam, bigWig, or bigBed files. If the list includes feature types, they are verified as existing in the provided database. If the list includes file names, the local files are checked and the names prefixed with "file:" for use downstream.

If no list was provided, then a list of available feature types in the provided database will be presented to the user, and the user prompted to make a selection. One or more types may be selected, and a single item may be enforced if requested. The response is filtered through the parse_list() method from Bio::ToolBox::utility, so a mix of single numbers or a range of numbers may be accepted. The responses are then validated.

Two types of databases are supported: Bio::DB::SeqFeature::Store and Bio::DB::BigWigSet databases. For SeqFeature Store databases, the type is comprised of "method:source", corresponding to the third and second GFF columns. For BigWigSet databases, types correspond to either the type, method, primary_tag, or display_name attributes, in that order of availability.

For feature types or files that pass validation, they are returned as a list. Types of files that do not pass validation are printed to STDOUT.

To use this subroutine, pass an array with the following keys and values. Not every key is required.

db

The name of the database or a reference to an established BioPerl database object. A Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database are typically used. Only required when no features are provided.

feature

Pass either a single dataset name as a scalar or an anonymous array reference of a list of dataset names. These may have been provided as a command line option and need to be verified. If nothing is passed, then a list of possible datasets will be presented to the user to be chosen.

prompt

Provide a phrase to be prompted to the user to help in selecting datasets from the list. If none is provided, a generic prompt will be used.

single

A Boolean value (1 or 0) indicating whether only a single dataset is allowed when selecting datasets from a presented list. If true, only one dataset choice is accepted. If false, one or more dataset choices are allowed. Default is false.

limit

Optionally provide a word or regular expression that matches the feature type (primary_tag only; source_tag, if present, is ignored). For example, provide "gene|mrna" to only present gene and mRNA features to the user. This is only applicable when a user must select from a database list. The default is to list all available feature types.

The subroutine will return a list of the accepted datasets. It will print bad dataset names to standard out.

Example:

        # verify a dataset, either file or database type
        my $db_name = 'cerevisiae';
        my $dataset = 'microaray_data';
        my @features = verify_or_request_feature_types(
                db      => $db_name,
                feature => $dataset,
        );
        unless (@features) {
                die " no valid feature provided!\n";
        }
        
        # select a list of features
        my $db_name = 'cerevisiae';
        my @types = (); # user not provided
        @types = verify_or_request_feature_types(
                db      => $db_name,
                feature => $types,
        );
        # user will be promoted to select from database list
        if (@types) {
                print "using types " . join(", ", @types);
        }
        else {
                die " no valid features selected!\n";
        }
        
check_dataset_for_rpm_support($dataset, [$cpu])

This subroutine will check a dataset for RPM, or Reads Per Million mapped, support. Only two types of database files support this, Bam files and BigBed files. If the dataset is either one of these, or the name of a database feature which points to one of these files, then it will calculate the total number of mapped alignments (Bam file) or features (BigBed file). It will return this total number. If the dataset does not support RPM (because it is not a Bam or BigBed file, for example), then it will return undefined.

Pass this subroutine one or two values. The first is the name of the dataset. Ideally it should be validated using verify_or_request_feature_types() and have an appropriate prefix (file, http, or ftp).

For multi-threaded execution pass a second value, the number of CPU cores available to count Bam files. This will speed up counting bam files considerably. The default is 2 for environments where Parallel::ForkManager is installed, or 1 where it is not.

get_new_feature_list(%options)

This subroutine will generate a new feature list collected from the database. Once the list of genomic features is generated, then data may be collected for each item in the list.

The subroutine will generate and return a data hash as described in Bio::ToolBox::file_helper. The data table will have two or three columns. The feature name and type:source are listed in columns one and two, respectively. If the features have an Alias tag, then a third column is included with a comma delimited list of the feature aliases.

The subroutine is passed an array containing the arguments. The keys include

db

The name of the database or a reference to an established database object.

features

A scalar value containing a name representing the type(s) of feature(s) to collect. This name will be parsed into an actual list with the internal subroutine _features_to_classes(). Refer to that documentation for a list of appropriate features.

The subroutine will return a reference to the data hash. It will print status messages to STDOUT.

Example

        my $db_name = 'cerevisiae';
        my %data = get_new_feature_list(
                'db'        => $db_name,
                'features'  => 'genes',
        );
get_new_genome_list(%options)

This subroutine will generate a new list of genomic windows. The genome is split into intervals of a specified size that is moved along the genome in specified step sizes.

The subroutine will generate and return a data hash as described in Bio::ToolBox::file_helper. The data table will have 3 columns, including Chromosome, Start, and Stop.

The subroutine is passed an array containing the arguments. The keys include

db

The name of the database or a reference to an established database object. Required.

win

A scalar value containing an integer representing the size of the window in basepairs. The default value is defined in biotoolbox.cfg file.

step

A scalar value containing an integer representing the step size for advancing the window across the genome. The default is the window size.

The subroutine will return a reference to the data hash. It will print status messages to STDOUT.

Example

        my $db_name = 'cerevisiae';
        my $window_size = 500;
        my $step_size = 250;
        my %data = get_new_genome_list(
                'db'        => $db_name,
                'win'       => $window_size,
                'step'      => $step_size,
        );
validate_included_feature($seqfeature)

This subroutine will validate a database feature to make sure it is useable. It will check feature attributes and compare them against a list of attributes and values to be avoided. The list of unwanted attributes and values is stored in the BioToolBox configuration file biotoolbox.cfg.

Pass the subroutine a Bio::DB::SeqFeature::Store database feature. It will return true (1) if the feature passes validation and false (undefined) if it contains an excluded attribute and value.

get_db_feature(%options)

This subroutine will retrieve a specific feature from a Bio::DB::SeqFeature::Store database for subsequent analysis, manipulation, and/or score retrieval using the get_chromo_region_score() or get_region_dataset_hash() methods. It relies upon unique information to pull out a single, unique feature.

Several attributes may be used to pull out the feature, including the feature's unique database primary ID, name and/or aliases, and GFF type (primary_tag and/or source). The get_new_feature_list() subroutine will generate a list of features with their unique identifiers.

The primary_id attribute is preferentially used as it provides the best performance. However, it is not portable between databases or even re-loading. In that case, the display_name and type are used to identify potential features. Note that the display_name may not be unique in the database. In this case, the addition of aliases may help. If all else fails, a new feature list should be generated.

To get a feature, pass an array of arguments. The keys include

db

The name of the Bio::DB::SeqFeature::Store database or a reference to an established database object.

id

Provide the primary_id tag. In the Bio::DB::SeqFeature::Store database schema this is a (usually) non-portable, unique identifier specific to a database. It provides the fastest lookup.

name

A scalar value representing the feature display_name. Aliases may be appended with semicolon delimiters.

type

Provide the feature type, which is typically expressed as primary_tag:source. Alternatively, provide just the primary_tag only.

While it is possible to identify features with any two attributes (or possibly just name or ID), the best performance is obtained with all three together. The first SeqFeature object is returned if found.

get_segment_score(@parameters)

Generic method for collecting score(s) from a database. This is the primary interface to all of the database handling packages, including Bam, bigWig, bigBed, and USeq files, as well as SeqFeature databases. By passing common parameters here, the appropriate database or file handling packages are loaded during run time and the appropriate methods are called. Essentially, this is where the magic happens.

Pass the method an array of parameters. The parameters are not keyed as a hash, and the order is explicit. The order is as follows.

0 Chromosome or Seq-ID
1 Start coordinate (integer, 1-based)
2 Stop coordinate (integer, 1-based)
3 Strand [-1,0,1]
4 Strandedness [sense, antisense, all]
5 Method (string)

Current acceptable methods include mean, median, sum, min, max, stddev, count, ncount, pcount.

6 Return type [0, 1, 2]

A return type of 0 indicates a single score should be returned, calculated using the specified method. A return type of 1 is an array or array reference of all scores found. A return type of 2 is a hash or hash reference of coordinate => score.

7 Database (string or object)

Either a name of database (which will be opened) or an already opened database object. The value can be undefined if no database is required.

8 Dataset (string)

This may be either a database type (primary_tag or primary_tag:source) or the path to a data file, e.g. Bam, bigWig, bigBed, or USeq file. If multiple datasets are to be combined, they may be appended to the array

The returned item is dependent on the value of the return type code.

calculate_score($method, $values)

This subroutine will perform the math to calculate a single score from an array of values. Current acceptable methods include mean, median, sum, min, max, stddev, count, ncount, and pcount.

Pass the subroutine two items: the name of the mathematical method enumerated above, and an array reference of the values to work on. A single score will be returned.

get_chromosome_list($db)

This subroutine will collect a list of chromosomes or reference sequences in a Bio::DB database and return the list along with their sizes in bp. Many BioPerl-based databases are supported, including Bio::DB::SeqFeature::Store, Bio::DB::Fasta, Bio::DB::Sam, Bio::DB::HTS, Bio::DB::BigWig, Bio::DB::BigWigSet, and Bio::DB::BigBed, or any others that support the "seq_ids" method. See the open_db_connection() subroutine for more information.

Pass the subroutine either the name of the database, the path to the database file, or an opened database object.

Optionally pass a second value, a boolean argument to limit and exclude unwanted chromosomes as defined by the "chromosome_exclude" option in the BioToolBox configuration file, biotoolbox.cfg. A true value limits chromosomes, and false includes all chromosomes. The default is to return all chromosomes. Sometimes some sequences are simply not wanted in analysis, like the mitochondrial chromosome or unmapped contigs.

The subroutine will return an array, with each element representing each chromosome or reference sequence in the database. Each element is an anonymous array of two elements, the chromosome name and length in bp.

Example

        my $db = open_db_connection('cerevisiae');
        # get all chromosomes in the database
        my @chromosomes = get_chromosome_list($db);
        foreach (@chromosomes) {
                my $name = $_->[0];
                my $length = $_->[1];
                print "chromosome $name is $length bp\n";
        }
low_level_bam_coverage($sam, $tid, $start, $stop)

This is a convenience method for running the low level bam coverage method. Since both Bio::DB::Sam and Bio::DB::HTS bam file adapters are supported, and each has slight variation in the API syntax, this method helps to abstract the actual method and use the appropriate syntax depending on which adapter is loaded. It is best if the $sam object was opened using the open_db_connection() method, or that $BAM_ADAPTER is set.

NOTE that this is the LOW level coverage method based on the index object, and not the similarly named high level API method. Read the adapter documentation for proper usage.

low_level_bam_fetch($sam, $tid, $start, $stop, $callback, $data)

This is a convenience method for running the low level bam fetch method. Since both Bio::DB::Sam and Bio::DB::HTS bam file adapters are supported, and each has slight variation in the API syntax, this method helps to abstract the actual method and use the appropriate syntax depending on which adapter is loaded. It is best if the $sam object was opened using the open_db_connection() method, or that $BAM_ADAPTER is set.

NOTE that this is the LOW level fetch method based on the index object, and not the similarly named high level API method. Read the adapter documentation for proper usage.

get_genomic_sequence($db, $chrom, $start, $stop)

This is a wrapper function for fetching genomic sequence from three different possible fasta database adapters (which all have different APIs), including the Bio::DB::Fasta, Bio::DB::HTS::Faidx, and Bio::DB::Sam::Fai adapters, as well as Bio::DB::SeqFeature::Store and possibly other BioPerl databases. Pass the opened database object, chromosome, start, and stop coordinates. This assumes BioPerl standard 1-base coordinates. Only the forward sequence is retrieved. The sequence is returned as a simple string.

INTERNAL SUBROUTINES

These are not intended for normal consumption but are documented here so that we know what is going on.

_features_to_classes($string)

This internal subroutine provides a conveniant look up and conversion of a single-word description of a category of features into a list of actual feature classes in the database. For example, the word 'gene' may include all ORFs, snRNAs, snoRNAs, and ncRNAs.

Pass the subroutine the feature category name as a scalar value. The actual list of feature types will be collected and returned as an array. Multiple values may be passed as a comma-delimited string (no spaces).

The aliases and feature lists are specified in the Bio::ToolBox::db_helper configuration file, biotoolbox.cfg. Additional lists and aliases may be placed there. The lists are database specific, or they can be added to the default database.

If the passed category name does not match an alias in the config file, then it is assumed to be a feature in the database. No further validation will be done (if it's not valid, simply no features would be returned from the database).

Also, feature types may be passed as the GFF's method:source, in which case they are assumed to be valid and not checked.

_lookup_db_method(\@parameters)

Internal method for determining which database adapter and method call should be used given a set of parameters. This result is cached for future data queries (likely hundreds or thousands of repeated calls). The method is stored in the global hash %DB_METHODS;

Pass the parameter array reference from get_segment_score(). This should load the appropriate db_helper sub module during run time.

_load_helper_module($string)

Subroutine to load and import a db_helper module during run time. Sets the appropriate global variable if successful.

_load_bam_helper_module()

Subroutine to determine which Bam adapter db_helper module to load: the older samtools-based adapter or the newer HTSlib-based adapter. Uses the global and exportable variable $BAM_ADAPTER to set a preference for which adapter to use. Use 'sam' or 'hts' or some string containing these names. Do NOT change this variable after loading the helper module; it will not do what you think it will do. If both are available and a preference is not set then the hts helper is the default.

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.