The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

split-ppred-ali.pl - Split ALI files into subsets of sites based on ppred data

VERSION

version 0.180190

SYNOPSIS

    $ split-ppred-ali.pl cpVITRELLA-80x8363.puz --phylip
        --sim-files=`ls cpNOVITRELLA-79x8363-CATGTRG-PP_sample_*.ali`
        --sim-seq-list=sim.idl --obs-seq-list=obs.idl
        --bin-number=10 --percentile --out=-ppred

    $ cat sim.idl
    Karlodiniu

    $ cat obs.idl
    Vitrella_b

    # for testing
    $ perl -Ilib bin/split-ppred-ali.pl test/for-ppred.phy --phylip
        --sim-files=`ls test/ppred-*.phy`
        --sim-seq-list=test/sim.idl --obs-seq-list=test/obs.idl
        --bin-number=10 --percentile --out=-ppred --only-mask

    $ perl -Ilib bin/split-ppred-ali.pl test/for-ppred.phy --phylip
        --sim-files=`ls test/ppred-*.phy` --by-seq --only-dump-freqs

At each site, a profile is computed from the simulated primary sequences for ids listed in --sim-seq-list and compared to the character state observed in the sequences listed in --obs-seq-list. A mask corresponding to the simulated frequencies for the observed states is built and sites are ranked according to these descending frequencies, which means that highest bins include sites where the observed state is rarely (or never found) in simulations. Sites where the state is a gap or is missing always get the maximum frequency and thus fall in the lowest bins.

USAGE

    split-ppred-ali.pl <infiles> --simfiles=<files>... [optional arguments]

REQUIRED ARGUMENTS

<infiles>

Path to input ALI files [repeatable argument]. If infiles are not in ALI but in PHYLIP format, use the --phylip option below.

--sim-files=<files>...

List of paths to simulated input files. These files are assumed to be in PHYLIP format as they result from PhyloBayes' ppred.

OPTIONAL ARGUMENTS

--out[-suffix]=<suffix>

Suffix to append to (possibly stripped) infile basenames for deriving outfile names [default: none]. When not specified, outfile names are taken from infiles but original infiles are preserved by being appended a .bak suffix.

--sim-seq-list=<file>

Path to IDL file listing the ids of the sequences from which site profiles will be computed after acquiring simulated input files [default: all seqs].

--obs-seq-list=<file>

Path to IDL file listing the ids of the sequences that will give observed frequencies and thus govern site trimming in infiles [default: all seqs].

--by-seq

Enable seq-specific simulated site profiles [default: no]. When not specified, average site profiles are computed from simulated input files.

--from-scafos

Consider the input ALI file as generated by SCaFoS [default: no]. Currently, specifying this option results in turning all ambiguous and missing character states to gaps.

--del-const

Delete constant sites just as the -dc option of PhyloBayes [default: no].

--phylip

Assume infiles and outfiles are in PHYLIP format (instead of ALI format) [default: no].

--bin-number=<n>

Number of bins to define [default: 10].

--percentile

Define bins containing an equal number of sites rather than bins of equal width in terms of observed state frequencies [default: no].

--cumulative

Define bins including all previous bins [default: no]. This leads to ALI outfiles of increasing width where the sequences listed in -obs-seq-list include ever more character states rarely observed in simulated primary sequences for ids listed in --sim-seq-list.

--only-mask

Mask rarely observed states in sequences listed in --obs-seq-list instead of removing the corresponding sites from the alignment [default: no].

--only-dump-freqs

Output simulated and observed state frequencies instead of producing regular output files [default: no]. When specified, this option supercedes all those pertaining to site binning.

--reorder

Reorder sequences following descending observed freqs [default: no]. This option only applies when --only-dump-freqs is specified.

--version
--usage
--help
--man

Print the usual program information

AUTHOR

Denis BAURAIN <denis.baurain@uliege.be>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.