The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CpG_calculator.pl

A script to calculate observed vs expected CpG dinucleotides

SYNOPSIS

CpG_calculator.pl --fasta <directory|filename> [--options...]

CpG_calculator.pl --db <text> [--options...]

  Options:
  --db <name|file|directory>
  --fasta <file|directory>
  --in <filename>
  --win <integer>
  --out <filename> 
  --gz
  --cpu <integer>
  --version
  --help

OPTIONS

The command line flags and descriptions:

--db <name|file|directory>
--fasta <file|directory>

Provide the name of a Bio::DB::SeqFeature::Store database from which to collect the genomic sequence. Alternatively, provide the name of an uncompressed Fasta file (multi-fasta is ok) or directory containing multiple fasta files representing the genomic sequence. The directory must be writeable for a small index file to be written. For more information about using databases, see https://code.google.com/p/biotoolbox/wiki/WorkingWithDatabases. The database may be provided in the metadata of an input file.

--in <filename>

Optionally specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. The first row should be column headers. Text files generated by other BioToolBox scripts are acceptable. Files may be gzipped compressed.

--win <integer>

Optionally provide the window size in bp with which to scan the genome. Option is ignored if an input file is provided. Default is 1000 bp.

--out <filename>

Specify the output filename. By default it uses the input file base name if provided. Required if no input file is provided.

--gz

Specify whether (or not) the output file should be compressed with gzip.

--cpu <integer>

Specify the number of CPU cores to execute in parallel. This requires the installation of Parallel::ForkManager. With support enabled, the default is 2. Disable multi-threaded execution by setting to 1.

--version

Print the version number.

--help

Display this POD documentation.

DESCRIPTION

This program will calculate percent GC composition, number of CpG dinucleotide pairs, number of expected CpG dinucleotide pairs based on GC content, and the ratio of observed / expected CpG pairs. Calculations are performed on either windows across the entire genome (default behavior using 1000 bp windows) or user-provided regions in an input file (BED, GFF, or custom text file are supported).

Genomic sequence may be provided in two ways. First, a Fasta file or directory of Fasta files may be provided. A small index file will be written to assist in random access using the Bio::DB::Fasta module. Alternatively, a Bio::DB::SeqFeature::Store database with sequence may be provided. Depending on the database driver and implementation, the fasta option is usually faster.

The four additional columns of information are appended to the input or generated file.

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.