NAME

Bio::ToolBox - Tools for querying and analysis of genomic data

DESCRIPTION

The Bio::ToolBox libraries provide a useful interface for working with bioinformatic data. Many bioinformatic data analysis revolves around working with tables of information, including lists of genomic annotation (genes, promoters, etc.) or defined regions of interest (epigenetic enrichment, transcription factor binding sites, etc.). This library works with these tables and provides a set of common tools for working with them.

Opening and saving common tab-delimited text formats
Support for BED, GFF, VCF, narrowPeak files
Scoring intervals and annotation with datasets from microarray or sequencing experiments, including ChIPSeq, RNASeq, and more
ChIPSeq, RNASeq, microarray expression
Support for Bam, BigWig, BigBed, wig, and USeq data formats
Works with any genomic annotation in GTF, GFF3, and UCSC formats

The libraries provide a unified and integrated approach to analyses. In many cases, they provide an abstraction layer over a variety of different specialized BioPerl and related modules. Instead of writing numerous scripts specialized for each data format (wig, bigWig, Bam), one script can now work with any data format.

See online documentation at https://tjparnell.github.io/biotoolbox/ for more information.

LIBRARIES

The libraries and modules are available to extend existing scripts or to write your own.

Bio::ToolBox::Data

This is the primary library module for working with a table of data, either generated as a new list from a database of annotation, or opened from a tab-delimited text file, for example a BED file of regions. Columns and rows of data may be added, deleted, or manipulated with ease.

Additionally, genomic data may be collected from a wide variety of sources using the information in the data table. For example, scoring microarray or sequencing data for each interval listed in the data table.

This module uses an object-oriented interface. Many of the methods and API will be familiar to users of Bio::Perl.

Bio::ToolBox::Data::Feature

This is the object class for working with individual rows in a table of data. It provides a number of conventions for working with the rows in a standard fashion, for example returning the start column value regardless of which column it is or whether the table is bed or gff or an arbitrary text file. A number of convenience methods are present for collecting data from data files. This module is not used directly by the user, but its objects are returned when using Bio::ToolBox::Data iterators.

Bio::ToolBox::Parser

This is the working base class for parsing annotation files, including BED and related formats, GFF, GTF, GFF3, and UCSC-derived refFlat, genePred, and genePredExt tables. This is designed to slurp an entire genome-worth of annotation into memory within a reasonably short amount of time. Sub-classes include the following.

Bio::ToolBox::Parser::bed: This parses simple BED formats (3-6 columns), gene-based BED files (12 columns), ENCODE-style peak formats (narrowPeak, broadPeak, and gappedPeak), and other BED-related derivatives. Gene-based BED12 files are parsed into hierarchical parent and child subfeatures.
Bio::ToolBox::Parser::gff: This parses both GTF and GFF3 file formats. Unlike many other GFF parsers that work line-by-line only, this maintains parent and child hierarchical relationships as parent feature and child subfeatures. To further maintain control and reduce unnecessary parsing, unwanted feature types can be selectively skipped.
Bio::ToolBox::Parser::ucsc: This parses various UCSC file formats, including different refFlat, GenePred, and knownGene flavors. Genes, transcripts, and exons are assembled into hierarchical child-parent relationships as desired.

Bio::ToolBox::SeqFeature

This is a fast, lean, simple object class for representing genomic features. It supports, for the most part, the Bio::SeqFreatureI and Bio::RangeI API interface without the dependencies. It uses an unorthodox blessed-array object structure, which provides measurable improvements in memory consumption and speed when loading thousands of annotated SeqFeature objects (think hg19 or hg38 annotation).

Bio::ToolBox::GeneTools

This is a collection of exportable functions for working with Bio::SeqFeatureI compliant objects representing genes and transcripts. It works with objects derived from one of the "Annotation parsers" or a Bio::DB::SeqFeature::Store database. The functions make hard things easy, such as identifying whether a transcript is coding or not (is it encoded in the primary_tag or source_tag or GFF attribute or does it have CDS subfeatures?), or identify the alternative exons or introns of a multi-transcript gene, or pull out the 5' UTR (which is likely not explicitly defined in the table).

SCRIPTS

The BioToolBox package comes complete with a suite of high-quality production-ready scripts ready for a variety of analyses. Look in the scripts folder for details. A sampling of what can be done include the following:

Annotated feature collection and selection
Data collection and scoring for features
Data file format manipulation and conversion
Low-level processing of sequencing data into customizable wig representation

Scripts have built-in documentation. Execute the script without any options to print a synopsis of available options, or add --help to print the full documentation.

Data conversion

Convert from generic tables to specific bioinformatic file types.

bam2wig.pl: Generate read or fragment coverage or point data representations of alignments.
data2bed.pl: Convert a table containing coordinates into a properly formatted BED file.
data2wig.pl: Convert a table of coordinates and values into a properly formatted WIG file, including bigWig.
data2fasta.pl: Convert a data table of coordinates and/or sequences into multi-fasta file.
data2gff.pl: Convert a table of coordinates into a properly formatted GFF file.

Feature annotation

Work with large genomic annotation feature files.

get_features.pl: Collect, filter, and/or convert features from a genomic feature annotation file into another (simpler) file for use.
get_gene_regions.pl: Collect specific gene regions that may not be explicitly annotated but inferred from an annotation file, including introns, UTRs, alternate or common exons, etc.
get_feature_info.pl: Collect additional information from a genomic feature annotation file for a list of features, such as items embedded as key=value attributes in a GFF file.

Data collection

Collect data, usually some sort of scores, from genomic data, including bigWig and Bam data files among others, for a list of genomic intervals for annotation features.

get_datasets.pl: General purpose single data collection of scores in a variety of methods.
get_binned_data.pl: Collect data in a subset of bins across genomic intervals or features in a variety of methods.
get_relative_data.pl: Collect data in bins flanking a specific reference point, such as the 5-prime end or middle point of a genomic feature.
correlate_position_data.pl: Calculates a correlation between two datasets along the length of a genomic feature to determine a shift of position for two signal tracks.

Data manipulation

Work with data columns and/or rows in data tables.

manipulate_datasets.pl: An interactive, menu-driven application for quickly and easily performing all sorts of common functions on columns, rows, and values.
manipulate_wig.pl: Performs various numeric transformations on scores of text WIG, bedGraph, and bigWig files.

File manipulation

Work on columns or rows of one or more data tables.

merge_datasets.pl: Join columns from two or more data files into one file, with or without using a lookup value.
split_data_file.pl: Split a data file by rows into multiple files.
join_data_file.pl: Joins two or data files by rows into one file.
pull_features.pl: Take a list of identifiers and pull the corresponding rows from a source file into a separate table of wanted features.

USAGE

This module provides a handful of commonly used convenience methods as entry points to working with data files. Most of them use or return a Bio::ToolBox::Data object.

Methods

load_file

Open a tab-delimited text file as a Bio::ToolBox::Data object. Simply pass the file path as a single argument. It assumes the first row is the column headers, and comment lines begin with #. Compressed files are transparently handled. See the Bio::ToolBox::Data new method for more details or options.

$Data = Bio::ToolBox->load_file('myfile.txt');

For advanced options, pass key => value pairs as arguments as defined for Bio::ToolBox::Data new().

parse_file

Parse an annotation file, such as BED, GTF, GFF3, UCSC genePred or refFlat file, into a Bio::ToolBox::Data table with two columns: PrimaryID (geneID, transcriptID, or coordinate string) and Name. Each row in the resulting table is linked to a parsed, top-level SeqFeature object. See the Bio::ToolBox::Data new method for more details or options. Default options include parsing subfeatures (exon, cds, and utr) and simple GFF attributes.

$Data = Bio::ToolBox->parse_file('genes.gtf.gz');

new_data

Generate a new, empty Bio::ToolBox::Data table with the given column names. Pass an array of names of the columns for the new table.

$Data = Bio::ToolBox->new_data( qw(Name ID Score) );

Alternatively, you can pass an array of key => value arguments to be passed on to new() function for explicit control.

new_bed

Generate a new, empty Bio::ToolBox::Data table formatted as a BED format. Pass the number of columns desired (integer in range 3..12 inclusive). Default is 6 (standard BED format).

$Data = Bio::ToolBox->new_bed(4);

read_file

Open a generic file handle for reading. It transparently handles compression as necessary. Returns an IO::File object. Pass the file path as an argument.

$fh = Bio::ToolBox->read_file('mydata.txt.gz');

write_file

Open a generic file handle for writing. It transparently handles compression as necessary based on filename extension or passed options. It will use the pigz multi-threaded, external, compression utility if available. See the open_to_write_fh method in <Bio::ToolBox::Data::file> for more information.

$fh = Bio::ToolBox->write_file('mynewdata.txt.gz');

open_database

Open a binary database file, including Bam, bigWig, bigBed, Fasta, Bio::DB::SeqFeature::Store SQLite file or named MySQL connection, USeq file, or any other supported binary or indexed file formats. Database type is transparently and automatically checked by looking for common file extensions, if present. See the open_db_connection in Bio::ToolBox::db_helper for more information.

$db = Bio::ToolBox->open_database($database);

REPOSITORY

Source code for the Bio::ToolBox package is maintained at https://github.com/tjparnell/biotoolbox/.

Bugs and issues should be submitted at https://github.com/tjparnell/biotoolbox/issues.

AUTHOR

Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

LICENSE

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.

To install Bio::ToolBox, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Bio::ToolBox

CPAN shell

perl -MCPAN -e shell
install Bio::ToolBox

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)