The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

calculate_CUB.pl - a program to calculate sequence codon usage bias indices and other sequence parameters.

VERSION

VERSION: 0.01

SYNOPSIS

This program computes CUB indices for each sequence; the types of computed CUB indices depend on the provided options (see below).

In addition to CUB indices, the program also computes some other features such as counts of amino acids, GC-content of the whole sequence and the 3rd codon positions.

  # compute ENC, ENC_r, CAI, and tAI for each sequence in file cds.fa
  summarize_cds_stat.pl --cai CAI_param.top_200 --tai tAI_param \
  --enc enc,enc_r --seq cds.fa -o CUB_indice.tsv

OPTIONS

Mandatory options

-s/--seq-file

file containing sequences in fasta format, from which each sequence's CUB indices are computed.

Auxiliary options

-g/--gc-id

ID of genetic code table used for identifying amino acid encoded by each codon. Default is 1, i.e., standard code. See NCBI Genetic Code for valid IDs.

-t/--tai-param

file containing tAI value for each codon in the format 'codon<tab>tAI_value', which can be produced by build_tai_param.pl. If not given, tAI values would not be computed.

-c/--cai-param

similar to --tai-param, except that CAI values are provided in the same format. This file may be produced by build_cai_param.pl. If not given, CAI values would not be computed.

-e/--enc-methods

methods for ENC calculations. Available values are enc, enc_r, encp, and encp_r. encp* versions corrects background GC-content in calculations. *_r versions uses a new method to estimate missing F values. Check module Bio::CUA::CUB::Calculator to see details of these methods. Default is enc. Multiple methods can be specified as comma-separated string such as 'enc,encp,enc_r'.

-b/--base-comp

background base compositions used for correcting GC content in ENC calculations. This option has no effect unless encp* version methods are specified in --enc-methods.

the format is like this:

        seq_id1 #A      #T      #C      #G
        seq_id2 #A      #T      #C      #G
        ...   ...

where #A/#T/#C/#G are counts or fractions of each base type in background data (e.g., introns) for each sequence. For sequences without background base composition information, 'NA' will be returned for encp* methods.

-o/--out-file

the file to store the results. Default is to standard output, usually screen.

-h/--help

show the brief help message.

AUTHOR

Zhenguo Zhang, <zhangz.sci at gmail.com>

BUGS

Please report any bugs or feature requests to bug-bio-cua at rt.cpan.org or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-CUA. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this class with the perldoc command.

        perldoc Bio::CUA

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

Copyright 2015 Zhenguo Zhang.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.