The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

proteins_to_roles

The routine proteins_to_roles allows a user to gather the set of functional roles that are associated with specifc protein sequences. A single protein sequence (designated by an MD5 value) may have numerous associated functions, since functions are treated as an attribute of the feature, and multiple features may have precisely the same translation. In our experience, it is not uncommon, even for the best annotation teams, to assign distinct functions (and, hence, functional roles) to identical protein sequences.

For each input MD5 value, this routine gathers the set of features (fids) that share the same sequence, collects the associated functions, expands these into functional roles (for multi-functional proteins), and returns the set of roles that results.

Note that, if the user wishes to see the specific features that have the assigned functional roles, they should use proteins_to_functions instead (it returns the fids associated with each assigned function).

Example:

    proteins_to_roles [arguments] < input > output

The standard input should be a tab-separated table (i.e., each line is a tab-separated set of fields). Normally, the last field in each line would contain the identifer. If another column contains the identifier use

    -c N

where N is the column (from 1) that contains the subsystem.

This is a pipe command. The input is taken from the standard input, and the output is to the standard output. For each input line, there can be multiple output lines, one for each role the protein can map to. The role is added to the end of each line.

Documentation for underlying call

This script is a wrapper for the CDMI-API call proteins_to_roles. It is documented as follows:

  $return = $obj->proteins_to_roles($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a roles
proteins is a reference to a list where each element is a protein
protein is a string
roles is a reference to a list where each element is a role
role is a string

Command-Line Options

-c Column

This is used only if the column containing the subsystem is not the last column.

-i InputFile [ use InputFile, rather than stdin ]

Output Format

The standard output is a tab-delimited file. It consists of the input file with extra columns added.

Input lines that cannot be extended are written to stderr.