The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

get_relative_data.pl

A script to collect data in bins around a relative position.

SYNOPSIS

get_relative_data.pl --in <in_filename> --out <out_filename> [--options]

  Options for existing files:
  --in <filename> 
  
  Options for new files:
  --db <name|file>
  --feature <type | type:source | alias>, ...
  
  Options for data collection:
  --ddb <name|file>
  --data <dataset_name | filename>
  --method [mean|median|min|max|stddev|sum|rpm]             (mean)
  --value [score|count|pcount|length]                       (score)
  --strand [all|sense|antisense]                            (all)
  --force_strand
  --avoid
  --long
  --log
  
  Bin specification:
  --win <integer>                                           (50)
  --num <integer>                                           (20)
  --pos [5|m|3]                                             (5)
  
  Post-processing:
  --(no)sum                                                 (true)
  --smooth
  
  General Options:
  --out <filename>
  --gz
  --cpu <integer>                                           (2)
  --version
  --help

OPTIONS

The command line flags and descriptions:

--in <filename>

Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. The first row should be column headers. Bed files are acceptable, as are text files generated by other BioToolBox scripts. Files may be gzipped compressed.

--out <filename>

Specify the output file name. Required for new files; otherwise, input files will be overwritten unless specified.

--db <name | filename>

Specify the name of a Bio::DB::SeqFeature::Store annotation database from which gene or feature annotation may be derived. A database is required for generating new data files with features. This option may skipped when using coordinate information from an input file (e.g. BED file), or when using an existing input file with the database indicated in the metadata. For more information about using annotation databases, see https://code.google.com/p/biotoolbox/wiki/WorkingWithDatabases.

--feature [type, type:source]

Specify the type of feature to map data around. The feature may be listed either as GFF type or GFF type:source. The list of features will be automatically generated from the database. This is only required when an input file is not specified.

--ddb <name | filename>

If the data to be collected is from a second database that is separate from the annotation database, provide the name of the data database here. Typically, a second Bio::DB::SeqFeature::Store or BigWigSet database is provided here.

--data <dataset_name | filename>

Specify the name of the data set from which you wish to collect data. If not specified, the data set may be chosen interactively from a presented list. Other features may be collected, and should be specified using the type (GFF type:source), especially when collecting alternative data values. Alternatively, the name of a data file may be provided. Supported file types include BigWig (.bw), BigBed (.bb), or single-end Bam (.bam). The file may be local or remote.

--method [mean|median|stddev|min|max|range|sum|rpm]

Specify the method for combining all of the dataset values within the genomic region of the feature. Accepted values include:

  • mean (default)

  • median

  • sum

  • stddev Standard deviation of the population (within the region)

  • min

  • max

  • range Returns difference of max and min

  • rpm Reads Per Million mapped, Bam/BigBed only

--value [score|count|pcount|length]

Optionally specify the type of data value to collect from the dataset or data file. Four values are accepted: score, count, pcount, or length. The default value type is score. Note that some data sources only support certain types of data values. The types are detailed below.

  • score

    The default value. Supported by wig, bigWig, USeq, bigBed (if the features include the score column), GFF features, and Bam (returns non-transformed base pair coverage).

  • count

    Counts the number of features that overlap the search region. For long features (> 1 bp), these may include features that overlap or span beyond the search region. Supported by all databases.

  • pcount (precise count)

    Counts only those features that are contained within the search region, not overlapping. Supported by Bam, bigBed, USeq, and GFF features.

  • length

    Returns the length of long features. Supported by Bam, bigBed, USeq, and GFF features.

--strand [sense|antisense|all]

Specify whether stranded data should be collected for each of the datasets. Either sense or antisense (relative to the feature) data may be collected. The default value is 'all', indicating all data will be collected.

--force_strand

For features that are not inherently stranded (strand value of 0) or that you want to impose a different strand, set this option when collecting stranded data. This will reassign the specified strand for each feature regardless of its original orientation. This requires the presence of a "strand" column in the input data file. This option only works with input file lists of database features, not defined genomic regions (e.g. BED files). Default is false.

--avoid

Indicate whether search features of the same type should be avoided when calculating values in a window. Each window is checked for overlapping features of the same type; if the window does overlap another feature of the same type, no value is reported for the window. This option requires using named database features and must include a feature GFF type column. This is useful to avoid scoring windows that overlap a neighboring gene, for example. The default is false (return all values regardless of overlap).

--long

Indicate that the dataset from which scores are collected are long features (counting genomic annotation for example) and not point data (microarray data or sequence coverage). Normally long features are only recorded at their midpoint, leading to inaccurate representation at some windows. This option forces the program to collect data separately at each window, rather than once for each file feature or region and subsequently assigning scores to windows. This may result in counting features more than once if it overlaps more than one window, a result that may or may not be desired. Execution time will likely increase. Default is false.

--log

Dataset values are (not) in log2 space and should be treated accordingly. Output values will be in the same space. The default is false (nolog).

--win <integer>

Specify the window size. The default is 50 bp.

--num <integer>

Specify the number of windows on either side of the feature position (total number will be 2 x [num]). The default is 20, or 1 kb on either side of the reference position if the default window size is used.

--pos [5|m|3]

Indicate the relative position of the feature around which the data is mapped. Three values are accepted: "5" indicates the 5' prime end is used, "3" indicates the 3' end is used, and "m" indicates the middle of the feature is used. The default is to use the 5' end, or the start position of unstranded features.

--(no)sum

Indicate that the data should be averaged across all features at each position, suitable for graphing. A separate text file will be written with the suffix '_summed' with the averaged data. Default is true (sum).

--smooth

Indicate that windows without values should (not) be interpolated from neighboring values. The default is false (nosmooth).

--gz

Specify whether (or not) the output file should be compressed with gzip.

--cpu <integer>

Specify the number of CPU cores to execute in parallel. This requires the installation of Parallel::ForkManager. With support enabled, the default is 2. Disable multi-threaded execution by setting to 1.

--version

Print the version number.

--help

Display this help

DESCRIPTION

This program will collect data around a relative coordinate of a genomic feature or region. The data is collected in a series of windows flanking the feature start (5' position for stranded features), end (3' position), or the midpoint position. The number and size of windows are specified via command line arguments, or the program will default to 20 windows on both sides of the relative position (40 total) of 50 bp size, corresponding to 2 kb total (+/- 1 kb). Windows without a value may be interpolated (smoothed) from neigboring values, if available.

The default value that is collected is a dataset score (e.g. microarray values). However, other values may be collected, including 'count' or 'length'. Use the --method argument to collect alternative values.

Stranded data may be collected. If the feature does not have an inherent strand, one may be specified to enforce stranded collection or a particular orientation.

When features overlap, or the collection windows of one feature overlaps with another feature, then data may be ignored and not collected (--avoid).

The program writes out a tim data formatted text file. It will also generate a '*_summed.txt' file, in which each the mean value of all features for each window is generated and written as a data row. This summed data may be graphed using the biotoolbox script graph_profile.pl or merged with other summed data sets for comparison.

EXAMPLES

These are some examples of some common scenarios for collecting data.

Collect scores in intervals around start

You want to collect the mean score from a bigWig file in twenty 50 bp intervals flanking the start position of each feature in Bed file.

  get_relative_data.pl --data scores.bw --in input.bed
Collect scores in intervals around middle

You want to collect median scores in 20 bp intervals extending 500 bp from the midpoint of each feature.

  get_relative_data.pl --win 20 --num 25 --pos m --data scores.bw --in \
  input.txt
Collect scores in intervals from annotation database

You want to collect scores in intervals around the transcription start site of genes in an annotation database, but also avoid intervals that may overlap neighboring genes. You want to collect alignment counts from a Bam file in a stranded fashion. You also want to plot the profile.

  get_relative_data.pl --db annotation --feature gene --avoid --strand \
  sense --value count --method sum --data alignments.bam --out gene_tss
  
  graph_profile.pl --in gene_tss_summed.txt --min 0 --max 100
  

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.