The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

alncut - filter sites in alignments based on variation and gap-content

SYNOPSIS

alncut [options] [MULTIFASTA-FILE...]

DESCRIPTION

alncut takes multifasta format alignment data as input and returns that data filtered for sites with various properties. By default, only invariant sites (sites with no variation) are returned. When the -f option is used, sites will be returned that are invariant up to a specified cut-off. More precisely, a site will be returned if the complement of the largest frequency component of that site is less than or equal to the cut-off.

alncut may also be used to degap alignments. Gap-free sites may be selected with the -g option. When combined with the -f option, sites will be returned that are gap-free up to a cut-off, i.e. in which the gap-frequency is less than or equal to the cut-off.

With the ---allgap or -a option, alncut returns sites that contain only gaps. The -f option is ignored. In all of its uses, the -v option will cause alnsite to output the set-complement of sites it has selected. Therefore, to print all sites that are not all gap, combine the -a and -v options.

Parsimoniously informative sites are variable sites in which at least two different site-characters or states are each represented in at least two different sequences. alnsite wil return parsimoniously informative sites with the -parsinf or -p option.

Options specific to alnsite: -g, --gapfree print gap-free sites -a, --allgap print all-gap sites -p, --parsinf print parsimoniously informative sites -v, --negate print set-complement of selected sites -f, --frequency=<int> print sites with max <int> minor variants or gaps -f, --frequency=<float> print sites with max <float> minor variants or gaps -V, --verbose report number and indices of selected sites to STDERR

Options general to FAST: -h, --help print a brief help message --man print full documentation --version print version -l, --log create/append to logfile -L, --logname=<string> use logfile name <string> -C, --comment=<string> save comment <string> to log --format=<format> use alternative format for input --moltype=<[dna|rna|protein]> specify input sequence type

INPUT AND OUTPUT

alnsite is part of FAST, the FAST Analysis of Sequences Toolbox, based on Bioperl. Most core FAST utilities expect input and return output in multifasta format. Input can occur in one or more files or on STDIN. Output occurs to STDOUT. The FAST utility fasconvert can reformat other formats to and from multifasta.

OPTIONS

-g --gapfree

Print only sites that contain no gaps

-a --allgap

Print only sites that contain exclusively gaps

-p --parsinf

Print only sites that are parsimoniously informative. Parsimoniously informative sites are variable sites in which at least two different site-characters or states are each represented in at least two different sequences.

-v --negate

Print set-complement of sites otherwise selected; as a sole option, will print only variable sites

-f [int], --frequency=[int]

Print sites that contain gaps or minor variants up to a maximum of [int] sequences

-f [float], --frequency=[float]

Print sites that contain gaps or minor variants up to a maximum of [float] relative frequency

--verbose

Print numbers and indices of sites selected by the criteria to STDERR

-h, --help

Print a brief help message and exit.

--man

Print the manual page and exit.

--version

Print version information and exit.

-l, --log

Creates, or appends to, a generic FAST logfile in the current working directory. The logfile records date/time of execution, full command with options and arguments, and an optional comment.

-L [string], --logname=[string]

Use [string] as the name of the logfile. Default is "FAST.log.txt".

-C [string], --comment=[string]

Include comment [string] in logfile. No comment is saved by default.

--format=[format]

Use alternative format for input. See man page for "fasconvert" for allowed formats. This is for convenience; the FAST tools are designed to exchange data in Fasta format, and "fasta" is the default format for this tool.

-m [dna|rna|protein], --moltype=[dna|rna|protein]

Specify the type of sequence on input (should not be needed in most cases, but sometimes Bioperl cannot guess and complains when processing data).

EXAMPLES

Print sites that are not all gap:

    alncut -av data.fas

Print sites with gaps in maximum 2 sequences:

    alncut -gf 2 data.fas

Print sites in which the frequency of minor variants is less than 15 percent:

    alncut -f 0.15 data.fas

Print variable sites:

    alncut -v data.fas

SEE ALSO

    To degap each sequence on input individually, see

fastr --degap
man perlre
perldoc perlre

Documentation on perl regular expressions.

man FAST
perldoc FAST

Introduction and cookbook for FAST

The FAST Home Page"

CITING

If you use FAST, please cite Ardell (2013). FAST: FAST Analysis of Sequences Toolbox. Bioinformatics and Bioperl Stajich et al..