The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

graph_data.pl

A script to graph XY line or dot plots between data sets.

SYNOPSIS

graph_data.pl [--options] <filename>

  Options:
  --in <filename>
  --type [scatter | line | smooth]     scatter
  --pair <X_index>,<Y_index>
  --index <X_index&Y_index,...>
  --all
  --ma <window>,<step>
  --norm                               (disabled)
  --min=<value>                        (estimated)
  --xmin=<value>                       (estimated)
  --ymin=<value>                       (estimated)
  --max=<value>                        (estimated)
  --xmax=<value>                       (estimated)
  --ymax=<value>                       (estimated)
  --ticks <integer>                    4
  --xticks <integer>                   4
  --yticks <integer>                   4
  --format <integer>                   (none)
  --xformat <integer>                  (none)
  --yformat <integer>                  (none)
  --dim <integer>                      600 pixels
  --xdim <integer>                     600 pixels
  --ydim <integer>                     600 pixels
  --regression                         (disabled)
  --out <base_filename>                (none)
  --dir <foldername>                   (input basename)
  --cpu <integer>                      2
  --version
  --help                               extended documentation

OPTIONS

The command line flags and descriptions:

--in <filename>

Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. The first row should be column headers. Text files generated by other BioToolBox scripts are acceptable. Files may be gzipped compressed.

--type [scatter | line | smooth]

Indicate the type of graph to plot. Three different values are accepted. Scatter graphs will plot all pairwise X,Y values and draw a linear regression line through them. Line graphs will plot a continuous line connecting all pairwise X,Y data values. With noisy data, the plot will look the best if the values are first smoothed using a moving average (--ma option). Finally, a smooth graph is the same as a line graph, except the data values are smoothed using a bezier curve function. Note that the bezier smoothing function is not equivalent or as effective as a moving average. The default value is a scatter plot.

--pair <X_index>,<Y_index>

Specify the two datasets to plot together. Use the datasets' index (0-based) expressed as 'X,Y' with no spaces. Use the option repeatedly to plot multiple graphs. If no datasets are set, then the lists may be selected interactively from a list.

--index <X_index&Y_index,...>

An alternative method of specifying the datasets as a comma-delimited list of X&Y datasets, where the X and Y indices are demarcated by an ampersand. If no datasets are set, then the lists may be selected interactively from a list.

--all

Indicate that all available datasets in the input file should be plotted together. Redundant graphs are skipped, e.g. Y,X versus X,Y. If you wish to graph only a subset of datasets, provide a list and/or range using the --index option.

--norm

Datasets should (not) be normalized by converting to percentile rank values (0..1). This is helpful when the two datasets are not in similar scales. Default is false.

--ma <window>,<step>

Specify the values to smooth the data by moving average. Express the values as 'window,step', no spaces. All Y values within the window are averaged together and plotted against the window midpoint X value. The window position is then incremented by the step size. The step size should be equal or less than the window size. Both values must be real integers. This data manipulation is extremely useful and recommended for noisy datasets.

--min=<value>
--xmin=<value>
--ymin=<value>

Specify explicitly the minimum values for either the X or Y axes. Both may be set independently or to the same value with the --min option. The default is automatically calculated.

--max=<value>
--xmax=<value>
--ymax=<value>

Specify explicitly the maximum values for either the X or Y axes. Both may be set independently or to the same value with the --max option. The default is automatically calculated.

--ticks <integer>
--xticks <integer>
--yticks <integer>

Specify explicitly the number of major ticks for either the X or Y axes. Both may be set independently or to the same value with the --ticks option. The default is 4.

--format <integer>
--xformat <integer>
--yformat <integer>

Specify explicitly the number of decimal places to format the labels for the major axes' ticks. Both may be set independently or to the same value with the --format option. The default is 0.

--dim <integer>
--xdim <integer>
--ydim <integer>

Specify the dimensions of the graph in pixels. Default is 600 pixels square.

--regression

Plot the linear regression line for the data in the scatter plot. Line and smooth type plots do not get a regression line.

--out <base_filename>

Optionally specify the output filename prefix.

--dir <foldername>

Specify an optional name for the output subdirectory name. Default is the input filename base with '_graphs' appended.

--cpu <integer>

Specify the number of CPU cores to execute in parallel. This requires the installation of Parallel::ForkManager. With support enabled, the default is 2. Disable multi-threaded execution by setting to 1. Parallel execution is only applicable when a list of datasets are provided or the --all option is enabled; interactive execution is performed serially.

--version

Print the version number.

--help

Display this help as a POD.

DESCRIPTION

This program will graph pairwise data sets against each other. This is useful for determining correlations between data sets. The graphs are generated as PNG images and written to a subdirectory.

Three types of graphs may be generated, specified by the --type argument.

A scatter plot will plot all pairwise values between two datasets as a point on an X,Y graph. A line representing the linear regression of the two datasets is plotted above the points.

A line plot will plot the pairwise values between two datasets as a continuous line.

A smooth plot is a line plot of pairwise values between two datasets and smoothed by a bezier curve.

The datasets to be plotted may be specified either on the command line using the --pair or --index arguments. If not specified, the program defaults to an interactive mode and the user may repeatedly select the datasets from a list of available datasets in the data file.

The data in the datasets may be manipulated in several ways prior to plotting. The data may be converted to a percentile rank, smoothed by a moving average, constrained to minimum and maximum values, etc.

If the graph doesn't look like you expect, and you are not normalizing by converting to a percent rank (--norm), try explicitly setting the --min and --max values. The GD::Graph module tries its best at setting these automatically, but sometimes does funny things.

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.