The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

build_cai_param.pl - a program to calculate CAI for codons

VERSION

VERSION = 0.01

SYNOPSIS

This is a program to compute CAI at codon level with different methods. It is part of distribution http://search.cpan.org/dist/Bio-CUA/

# calculate codon CAI by choosing the top 200 highly expressed genes build_cai_param.pl -i seqs.fasta -e gene_expression.tsv -s 200 -o CAI_top200

# the same as above but normalize RSCUs with expected RSCUs under even # codon usage build_cai_param.pl -i seqs.fasta -e gene_expression.tsv -s 200 -o CAI_top200.by_mean -m mean

# normalize RSCUs by RSCUs derived from bottom 1000 lowely expressed genes build_cai_param.pl -i seqs.fasta -e gene_expression.tsv -s 200 -o CAI_top200.b1000 -b 1000

OPTIONS

All options have a short and a long forms, e.g., -i and --seq-file for first option.

In the following text, RSCU stands for relative synonymous codon usage.

Mandatory options

-i/--seq-file

a file containing protein-coding sequences in fasta format.

Auxiliary options

-e/--exp-file

a file containing sequence IDs and their expression in the forllowing format:

        seq-id1E<lt>tabE<gt>0.67
        seq-id2E<lt>tabE<gt>2.57
        ... ...

each line contains one sequence ID and the sequence's gene expression level (RNA, protein, or else), separated by tab. The sequence IDs must match the IDs in the sequence file specified above.

From this file, highly expressed genes will be selected according to the gene expression rank. See below options.

If this option is omitted, all the sequences in the above sequence file would be used for calculating CAIs.

-s/--select

determine how many sequences are chosen from the above expression file (by option --exp-file). Available formats are:

all, all IDs in the expression file are chosen.

0.##, a fraction of top highly expressed genes, say 0.30, then top 30% highly expressed genes are chosen.

###, an integer, say 200, then the top 200 highly expressed genes are chosen.

Default is all. If the option --exp-file is omitted, this option has no effect.

-b/--background

specify background data (e.g., lowly expressed genes) from which the background codon usage is derived. Then each codon's RSCU from highly expressed genes is divided by the codon's RSCU from the background data; these normalized RSCUs are used for CAI calculation. This method is termed 'background-normalization'.

How to specify background data: 0.##, ###, or filename, the former two formats choose a fraction of or a number of genes from the most lowly expressed genes specified in the expression file by --exp-file. See option --select for details of the two specification formats. The last format specifies a fasta-formatted sequence file from which background codon usage is calculated.

-g/--gc-id

ID of genetic code table. See NCBI genetic code for valid IDs. Default is 1, i.e., standard genetic code.

-m/--method

method to calculated CAI: max or mean. The former is used by <Sharp and Li, 1987, NAR>, in which each codon's RSCU is divided by the maximum of all synonymous codons to derive CAI. The 'mean' method divides each codon's RSCU by the expected RSCU under even codon usage to get CAI. For example, for an amino acid with four synonymous codons, the expected RSCU is 0.25 for each codon, so all observed RSCUs of this amino acid's codons are divided by 0.25.

If option --background is activated, the 'background-normalization' method always uses the max method to get final CAIs.

-o/--out-file

file to store the result. Default is standard output, usually screen.

AUTHOR

Zhenguo Zhang, <zhangz.sci at gmail.com>

BUGS

Please report any bugs or feature requests to bug-bio-cua at rt.cpan.org or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-CUA. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this class with the perldoc command.

        perldoc Bio::CUA

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

Copyright 2015 Zhenguo Zhang.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.