CpG_calculator.pl
A script to calculate observed vs expected CpG dinucleotides
CpG_calculator.pl --fasta <directory|filename> [--options...]
CpG_calculator.pl --db <text> [--options...]
Options: --db <name|file|directory> --fasta <file|directory> --in <filename> --win <integer> --out <filename> --gz --cpu <integer> --version --help
The command line flags and descriptions:
Provide the name of a Bio::DB::SeqFeature::Store database from which to collect the genomic sequence. Alternatively, provide the name of an uncompressed Fasta file (multi-fasta is ok) or directory containing multiple fasta files representing the genomic sequence. The directory must be writeable for a small index file to be written. For more information about using databases, see https://code.google.com/p/biotoolbox/wiki/WorkingWithDatabases. The database may be provided in the metadata of an input file.
Optionally specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. The first row should be column headers. Text files generated by other BioToolBox scripts are acceptable. Files may be gzipped compressed.
Optionally provide the window size in bp with which to scan the genome. Option is ignored if an input file is provided. Default is 1000 bp.
Specify the output filename. By default it uses the input file base name if provided. Required if no input file is provided.
Specify whether (or not) the output file should be compressed with gzip.
Specify the number of CPU cores to execute in parallel. This requires the installation of Parallel::ForkManager. With support enabled, the default is 2. Disable multi-threaded execution by setting to 1.
Print the version number.
Display this POD documentation.
This program will calculate percent GC composition, number of CpG dinucleotide pairs, number of expected CpG dinucleotide pairs based on GC content, and the ratio of observed / expected CpG pairs. Calculations are performed on either windows across the entire genome (default behavior using 1000 bp windows) or user-provided regions in an input file (BED, GFF, or custom text file are supported).
Genomic sequence may be provided in two ways. First, a Fasta file or directory of Fasta files may be provided. A small index file will be written to assist in random access using the Bio::DB::Fasta module. Alternatively, a Bio::DB::SeqFeature::Store database with sequence may be provided. Depending on the database driver and implementation, the fasta option is usually faster.
The four additional columns of information are appended to the input or generated file.
Timothy J. Parnell, PhD Howard Hughes Medical Institute Dept of Oncological Sciences Huntsman Cancer Institute University of Utah Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.
To install Bio::ToolBox, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::ToolBox
CPAN shell
perl -MCPAN -e shell install Bio::ToolBox
For more information on module installation, please visit the detailed CPAN module installation guide.