Example:
representative_sequences [arguments] < input > output
This is a pipe command. The input is taken from the standard input, and the output is to the standard output.
This script is a wrapper for the CDMI-API call representative_sequences. It is documented as follows:
we return two arguments. The first is the list of representative triples, and the second is the list of sets (the first entry always being the representative sequence)
These are sequences that are currently represenatives and the command extends this set.
order sequences using the designated option (note that -b is another way to get a long-to-short ordering). Supported options are
long-to-short default (as is)
order input sequences by size (long to short)
behavior of clustering algorithm (0 or 1, D=1)
cluster_type 0 is the original method, which has only the representative for each group in the blast database. This can randomly segregate distant members of groups, regardless of the placement of other very similar sequences.
cluster_type 1 adds more diverse representatives of a group in the blast database. This is slightly more expensive, but is much less likely to split close relatives into different groups.
With the -d option, each cluster of sequences is written to a distinct file in the specified directory.
With the -f option, for each cluster, a tab-separated list of ids is written to the specified file.
Sequences are removed if there similarity to a "kept" sequence exceeds a specified threshold (see -similarity below)
The possible measures of similarity that you can specify are as follows:
identity_fraction (default), positive_fraction (proteins only), or score_per_position (0-2 bits)
The similarity threshhold used to determine when sequences are deleted (but represented by a kept sequence).
$seq_set is a seq_set $rep_seq_parms is a rep_seq_parms $return_1 is an id_set $return_2 is a reference to a list where each element is an id_set seq_set is a reference to a list where each element is a seq_triple seq_triple is a reference to a list containing 3 items: 0: an id 1: a comment 2: a sequence id is a string comment is a string sequence is a string rep_seq_parms is a reference to a hash where the following keys are defined: existing_reps has a value which is a seq_set order has a value which is an int alg has a value which is an int type_sim has a value which is an int cutoff has a value which is a float id_set is a reference to a list where each element is an id
The input is a fasta-formatted set of sequences. These sequences should not contain indels.
FASTA output of the representatives is always written to STDOUT. The -d option will cause a directory to be built containing the clusters. The -f option will cause an abbreviated format of the clusters (just IDs) to be written
To install Bio::KBase, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::KBase
CPAN shell
perl -MCPAN -e shell install Bio::KBase
For more information on module installation, please visit the detailed CPAN module installation guide.