—#!perl
use
strict;
use
Data::Dumper;
use
Carp;
#
# This is a SAS Component
#
=head1 proteins_to_functions
The routine proteins_to_functions allows users to access functions associated with
specific protein sequences. The input proteins are given as a list of MD5 values
(these MD5 values each correspond to a specific protein sequence). For each input
MD5 value, a list of [feature-id,function] pairs is constructed and returned.
Note that there are many cases in which a single protein sequence corresponds
to the translation associated with multiple protein-encoding genes, and each may
have distinct functions (an undesirable situation, we grant).
This function allows you to access all of the functions assigned (by all annotation
groups represented in Kbase) to each of a set of sequences.
Example:
proteins_to_functions [arguments] < input > output
The standard input should be a tab-separated table (i.e., each line
is a tab-separated set of fields). Normally, the last field in each
line would contain the identifer. If another column contains the identifier
use
-c N
where N is the column (from 1) that contains the subsystem.
This is a pipe command. The input is taken from the standard input, and the
output is to the standard output. For each input line there are multiple output lines, one for each fid the protein maps to. Two columns are added to each output line, a fid and a function.
=head2 Documentation for underlying call
This script is a wrapper for the CDMI-API call proteins_to_functions. It is documented as follows:
$return = $obj->proteins_to_functions($proteins)
=over 4
=item Parameter and return types
=begin html
<pre>
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a fid_function_pairs
proteins is a reference to a list where each element is a protein
protein is a string
fid_function_pairs is a reference to a list where each element is a fid_function_pair
fid_function_pair is a reference to a list containing 2 items:
0: a fid
1: a function
fid is a string
function is a string
</pre>
=end html
=begin text
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a fid_function_pairs
proteins is a reference to a list where each element is a protein
protein is a string
fid_function_pairs is a reference to a list where each element is a fid_function_pair
fid_function_pair is a reference to a list containing 2 items:
0: a fid
1: a function
fid is a string
function is a string
=end text
=back
=head2 Command-Line Options
=over 4
=item -c Column
This is used only if the column containing the subsystem is not the last column.
=item -i InputFile [ use InputFile, rather than stdin ]
=back
=head2 Output Format
The standard output is a tab-delimited file. It consists of the input
file with extra columns added.
Input lines that cannot be extended are written to stderr.
=cut
my
$usage
=
"usage: proteins_to_functions [-c column] < input > output"
;
my
$column
;
my
$input_file
;
my
$kbO
= Bio::KBase::CDMI::CDMIClient->new_for_script(
'c=i'
=> \
$column
,
'i=s'
=> \
$input_file
);
if
(!
$kbO
) {
STDERR
$usage
;
exit
}
my
$ih
;
if
(
$input_file
)
{
open
$ih
,
"<"
,
$input_file
or
die
"Cannot open input file $input_file: $!"
;
}
else
{
$ih
= \
*STDIN
;
}
while
(
my
@tuples
= Bio::KBase::Utilities::ScriptThing::GetBatch(
$ih
,
undef
,
$column
)) {
my
@h
=
map
{
$_
->[0] }
@tuples
;
my
$h
=
$kbO
->proteins_to_functions(\
@h
);
for
my
$tuple
(
@tuples
) {
#
# Process output here and print.
#
my
(
$id
,
$line
) =
@$tuple
;
my
$v
=
$h
->{
$id
};
if
(!
defined
(
$v
))
{
STDERR
$line
,
"\n"
;
}
elsif
(
ref
(
$v
) eq
'ARRAY'
)
{
foreach
$_
(
@$v
)
{
my
$a
=
join
(
"\t"
,
@$_
);
"$line\t$a\n"
;
}
}
else
{
"$line\t$v\n"
;
}
}
}