The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

data2bed.pl

A program to convert a data file to a bed file.

SYNOPSIS

data2bed.pl [--options...] <filename>

  File Options:
  -i --in <filename>                    input file: txt, gff, vcf, etc
  -o --out <filename>                   output file name
  -H --noheader                         input file has no header row
  -0 --zero                             file is in 0-based coordinate system
  
  Column indices:
  --bed [3|4|5|6]                       type of bed to write
  -a --ask                              interactive selection of columns
  -c --chr <index>                      chromosome column
  -b --begin --start <index>            start coordinate column
  -e --end --stop <index>               stop coordinate column
  -n --name <text | index>              name column or base name text
  -s --score <index>                    score column
  -t --strand <index>                   strand column
  
  BigBed options:
  -B --bb --bigbed                      generate a bigBed file
  -d --db <database>                    database to collect chromosome lengths
  --chromof <filename>                  specify a chromosome file
  --bwapp </path/to/bedToBigBed>        specify path to bedToBigBed
  
  General Options:
  --sort                                sort output by genomic coordinates
  -z --gz                               compress output file
  -Z --bgz                              bgzip compress output file
  -v --version                          print version and exit
  -h --help                             show extended documentation

OPTIONS

The command line flags and descriptions:

File Options

--in <filename>

Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. Genome coordinates are required. The first row should be column headers. Text files generated by other BioToolBox scripts are acceptable. Files may be gzipped compressed.

--out <filename>

Specify the output filename. By default it uses the basename of the input file.

--noheader

The input file does not have column headers, often found with UCSC derived annotation data tables.

--zero

Indicate that the source data is already in interbase (0-based) coordinates and do not need to be converted. By convention, all BioPerl (and, by extension, all biotoolbox) scripts are base (1-based) coordinates. Default behavior is to convert.

Column indices

--bed [3|4|5|6]

Explicitly set the number of bed columns in the output file. Otherwise, it will attempt to write as many columns as available, filling in mock data as needed.

--ask

Indicate that the program should interactively ask for the indices for feature data. It will present a list of the column names to choose from. Enter nothing for non-relevant columns or to accept default values.

--chr <column_index>

The index of the dataset in the data table to be used as the chromosome or sequence ID column in the BED data.

--start <column_index>
--begin <column_index>

The index of the dataset in the data table to be used as the start position column in the BED data.

--start <column_index>
--end <column_index>

The index of the dataset in the data table to be used as the stop or end position column in the BED data.

--name <column_index | base_text>

Supply either the index of the column in the data table to be used as the name column in the BED data, or the base text to be used when auto-generating unique feature names. The auto-generated names are in the format 'text_00000001'. If the source file is GFF3, it will automatically extract the Name attribute.

--score <column_index>

The index of the dataset in the data table to be used as the score column in the BED data.

--strand <column_index>

The index of the dataset in the data table to be used for strand information. Accepted values might include any of the following: +, -, 1, -1, 0, .

BigBed options

--bigbed
--bb

Indicate that a binary BigBed file should be generated instead of a text BED file. A .bed file is first generated, then converted to a .bb file, and then the .bed file is removed.

--db <database>

Specify the name of a Bio::DB::SeqFeature::Store annotation database or other indexed data file, e.g. Bam or bigWig file, from which chromosome length information may be obtained. For more information about using databases, see https://code.google.com/p/biotoolbox/wiki/WorkingWithDatabases. It may be supplied from the input file metadata.

--chromf <filename>

When converting to a BigBed file, provide a two-column tab-delimited text file containing the chromosome names and their lengths in bp. Alternatively, provide a name of a database, below.

--bbapp </path/to/bedToBigBed>

Specify the path to the UCSC bedToBigBed conversion utility. The default is to first check the BioToolBox configuration file biotoolbox.cfg for the application path. Failing that, it will search the default environment path for the utility. If found, it will automatically execute the utility to convert the bed file.

General options

--sort

Sort the output file by genomic coordinates. Automatically enabled when compressing with bgzip or saving to bigBed.

--gz

Specify whether the output file should be compressed with gzip.

--bgz

Specify whether the output file should be compressed with block gzip (bgzip) for tabix compatibility.

--version

Print the version number.

--help

Display this POD documentation.

DESCRIPTION

This program will convert a tab-delimited data file into a BED file, according to the specifications here http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED. A minimum of three and a maximum of six columns may be generated. Thin and thick block data (columns greater than 6) are not written.

Column identification may be specified on the command line, chosen interactively, or automatically determined from the column headers. GFF source files should have columns automatically identified.

All lower-numbered columns must be defined before writing higher-numbered columns, as per the specification. Dummy data may be filled in for Name and/or Score if a higher column is requested.

Browser and Track lines are not written.

Following specification, all coordinates are written in interbase (0-based) coordinates. Base (1-based) coordinates (the BioPerl standard) will be converted.

Score values should be integers within the range 1..1000. Score values are not converted in this script. However, the biotoolbox script manipulate_datasets.pl has tools to do this if required.

An option exists to further convert the BED file to an indexed, binary BigBed format. Jim Kent's bedToBigBed conversion utility must be available, and either a chromosome definition file or access to a Bio::DB database is required.

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.