The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

data2gff.pl

A script to convert a generic data file to GFF format.

SYNOPSIS

data2gff.pl [--options...] <filename>

  Options:
  --in <filename>
  --ask
  --chr <column_index>
  --start <column_index>
  --stop | --end <column_index>
  --score <column_index>
  --strand <column_index>
  --name <text | column_index>
  --id <column_index>
  --tags <column_index,column_index,...>
  --source <text>
  --type <text | column_index>
  --zero
  --format [0,1,2,3]
  --midpoint
  --unique
  --out <filename> 
  --version [2,3]
  --gz
  --version
  --help

OPTIONS

The command line flags and descriptions:

--in <filename>

Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. Genome coordinates are required. The first row should be column headers. Text files generated by other BioToolBox scripts are acceptable. Files may be gzipped compressed.

--ask

Indicate that the program should interactively ask for column indices or text strings for the GFF attributes, including coordinates, source, type, etc. It will present a list of the column names to choose from. Enter nothing for non-relevant columns or to accept default values.

--chr <column_index>

The index of the dataset in the data table to be used as the chromosome or sequence ID column in the gff data.

--start <column_index>

The index of the dataset in the data table to be used as the start position column in the gff data.

--stop <column_index> =item --end <column_index>

The index of the dataset in the data table to be used as the stop or end position column in the gff data.

--score <column_index>

The index of the dataset in the data table to be used as the score column in the gff data.

--name <column_index>

Enter either the text that will be shared name among all the features, or the index of the dataset in the data table to be used as the name of each gff feature. This information will be used in the 'group' column.

--id <column_index>

The index of the dataset in the data table to be used as the unique ID of each gff feature. This information will be used in the 'group' column of GFF v.3 files only. The default is to automatically generate a unique identifier.

--strand <column_index>

The index of the dataset in the data table to be used for strand information. Accepted values might include any of the following "f(orward), r(everse), w(atson), c(rick), +, -, 1, -1, 0, .".

--tags <column_indices>

Provide a comma delimited list of column indices that contain values to be included as group tags in the GFF features. The key will be the column name.

--source <text | column_index>

Enter either a text string or a column index representing the GFF source that should be used for the features. The default is 'data'.

--type <text | column_index>

Enter either a text string or a column index representing the GFF 'type' or 'method' that should be used for the features. If not defined, it will use the column name for either the 'score' or 'name' column, if defined. As a last resort, it will use the most creative method of 'Experiment'.

--zero

Indicate whether the source data is in interbase or 0-based coordinates, as is used with UCSC source data or USeq data packages. The coordinates will then be converted to 1-based coordinates, consistent with the rest of bioperl conventions. The default is false (will not convert).

--format [0,1,2,3]

Indicate the number of decimal places the score value should be formatted. Acceptable values include 0, 1, 2, or 3 places. Anything else is ignored.

--midpoint

A boolean (1 or 0) value to indicate whether the midpoint between the actual 'start' and 'stop' values should be used instead of the actual values. Default is false.

--unique

Indicate whether the feature names should be made unique. A count number is appended to the name of subsequent features to make them unique. This should only be applied to genomic features, and not to genomic data values (microarray data, sequencing data, etc). The default behavior is false (not unique).

--out <filename>

Optionally specify the name of of the output file. The default is to use the assigned type value. The '.gff' extension is automatically added if required.

--version [2,3]

Specify the GFF version. The default is version 3.

--gz

Indicate whether the output file should (not) be compressed with gzip.

--version

Print the version number.

--help

Display the POD documentation

DESCRIPTION

This program will convert a data file into a GFF formatted text file. Only simple conversions are performed, where each data line is converted to a single feature. Complex features with parent-child relationships (such as genes) should be converted with something more advanced.

The input file should have chromosomal coordinates, i.e. chromosome, start, and (optionally) stop or end coordinates. They may be specified upon execution or identified automatically. If they are not found, the GFF conversion will fail.

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.