The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::Graph::ProteinGraph - a representation of a protein interaction graph.

SYNOPSIS

  # Read in from file
  my $graphio = Bio::Graph::IO->new(-file   => 'myfile.dat',
                                    -format => 'dip');
  my $graph   = $graphio->next_network();

Using ProteinGraph

  # Remove duplicate interactions from within a dataset
  $graph->remove_dup_edges();

  # Get a node (represented by a sequence object) from the graph.
  my $seqobj = $gr->nodes_by_id('P12345');

  # Get clustering coefficient of a given node.
  my $cc = $gr->clustering_coefficient($graph->nodes_by_id('NP_023232'));
  if ($cc != -1) {  ## result is -1 if cannot be calculated
    print "CC for NP_023232 is $cc";
  }

  # Get graph density
  my $density = $gr->density();

  # Get connected subgraphs
  my @graphs = $gr->components();

  # Remove a node
  $gr->remove_nodes($gr->nodes_by_id('P12345'));

  # How many interactions are there?
  my $count = $gr->edge_count;

  # How many nodes are there?
  my $ncount = $gr->node_count();

  # Let's get interactions above a threshold confidence score.
  my $edges = $gr->edges;
  for my $edge (keys %$edges) {
         if (defined($edges->{$edge}->weight()) &&
      $edges->{$edge}->weight() > 0.6) {
                    print $edges->{$edge}->object_id(), "\t",
             $edges->{$edge}->weight(),"\n";
         }
  }

  # Get interactors of your favourite protein
  my $node      = $graph->nodes_by_id('NP_023232');
  my @neighbors = $graph->neighbors($node); 
  print "      NP_023232 interacts with ";
  print join " ,", map{$_->object_id()} @neighbors;
  print "\n";

  # Annotate your sequences with interaction info
  my @seqs; ## array of sequence objects
  for my $seq(@seqs) {
    if ($graph->has_node($seq->accession_number)) {
       my $node = $graph->nodes_by_id( $seq->accession_number);
       my @neighbors = $graph->neighbors($node);
       for my $n (@neighbors) {
         my $ft = Bio::SeqFeature::Generic->new(
                      -primary_tag => 'Interactor',
                      -tags        => { id => $n->accession_number }
                      );
            $seq->add_SeqFeature($ft);
        }
     }
  }

  # Get proteins with > 10 interactors
  my @nodes = $graph->nodes();
  my @hubs;
  for my $node (@nodes) {
    if ($graph->neighbor_count($node) > 10) {
       push @hubs, $node;
    }
  }
  print "the following proteins have > 10 interactors:\n";
  print join "\n", map{$_->object_id()} @hubs;

  # Merge graphs 1 and 2 and flag duplicate edges
  $g1->union($g2);
  my @duplicates = $g1->dup_edges();
  print "these interactions exist in $g1 and $g2:\n";
  print join "\n", map{$_->object_id} @duplicates;

Creating networks from your own data

If you have interaction data in your own format, e.g.

  edgeid  node1  node2  score

  my $io = Bio::Root::IO->new(-file => 'mydata');
  my $gr = Bio::Graph::ProteinGraph->new();
  my %seen = (); # to record seen nodes
  while (my $l = $io->_readline() ) {

  # Parse out your data...
  my ($e_id, $n1, $n2, $sc) = split /\s+/, $l;

  # ...then make nodes if they don't already exist in the graph...
  my @nodes =();
    for my $n ($n1, $n2 ) {
                if (!exists($seen{$n})) {
        push @nodes,  Bio::Seq->new(-accession_number => $n);
                  $seen{$n} = $nodes[$#nodes];
      } else {
                        push @nodes, $seen{$n};
           }
    }
  }

  # ...and add a new edge to the graph
  my $edge  = Bio::Graph::Edge->new(-nodes => \@nodes,
                                    -id    => 'myid',
                                    -weight=> 1);
  $gr->add_edge($edge);

DESCRIPTION

A ProteinGraph is a representation of a protein interaction network. It derives most of its functionality from the Bio::Graph::SimpleGraph module, but is adapted to be able to use protein identifiers to identify the nodes.

This graph can use any objects that implement Bio::AnnotatableI and Bio::IdentifiableI interfaces. Bio::Seq (but not Bio::PrimarySeqI) objects can therefore be used for the nodes but any object that supports annotation objects and the object_id() method should work fine.

At present it is fairly 'lightweight' in that it represents nodes and edges but does not contain all the data about experiment ids etc. found in the Protein Standards Initiative schema. Hopefully that will be available soon.

A dataset may contain duplicate or redundant interactions. Duplicate interactions are interactions that occur twice in the dataset but with a different interaction ID, perhaps from a different experiment. The dup_edges method will retrieve these.

Redundant interaction are interactions that occur twice or more in a dataset with the same interaction id. These are more likely to be due to database errors. These methods are useful when merging 2 datasets using the union() method. Interactions present in both datasets, with different IDs, will be duplicate edges.

For Developers

In this module, nodes are represented by Bio::Seq::RichSeq objects containing all possible database identifiers but no sequence, as parsed from the interaction files. However, a node represented by a Bio::PrimarySeq object should work fine too.

Edges are represented by Bio::Graph::Edge objects. In order to work with SimpleGraph these objects must be array references, with the first 2 elements being references to the 2 nodes. More data can be added in $e[2]. etc. Edges should be Bio::Graph::Edge objects, which are Bio::IdentifiableI implementing objects.

At present edges only have an identifier and a weight() method, to hold confidence data, but subclasses of this could hold all the interaction data held in an XML document.

So, a graph has the following data:

1. A hash of nodes ('_nodes'), where keys are the text representation of a nodes memory address and values are the sequence object references.

2. A hash of neighbors ('_neighbors'), where keys are the text representation of a nodes memory address and a value is a reference to a list of neighboring node references.

3. A hash of edges ('_edges'), where a key is a text representation of the 2 nodes. E.g., "address1,address2" as a string, and values are Bio::Graph::Edge objects.

4. Look up hash ('_id_map') for finding a node by any of its ids.

5. Look up hash for edges ('_edge_id_map') for retrieving an edge object from its identifier.

6. Hash ('_components').

7. An array of duplicate edges ('_dup_edges').

8. Hash ('_is_connected').

REQUIREMENTS

To use this code you will need the Clone.pm module availabe from CPAN. You also need Class::AutoClass, available from CPAN as well. To read in XML data you will need XML::Twig available from CPAN.

SEE ALSO

Bio::Graph::SimpleGraph Bio::Graph::IO Bio::Graph::Edge Bio::Graph::IO::dip Bio::Graph::IO::psi_xml

FEEDBACK

Mailing Lists

User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated.

  bioperl-l@bioperl.org                  - General discussion
  http://bioperl.org/wiki/Mailing_lists  - About the mailing lists

Reporting Bugs

Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web:

  http://bugzilla.open-bio.org/

AUTHORS

 Richard Adams - this module, Graph::IO modules.

 Email richard.adams@ed.ac.uk

AUTHOR2

 Nat Goodman - SimpleGraph.pm, and all underlying graph algorithms.

has_node

 name      : has_node
 purpose   : Is a protein in the graph?
 usage     : if ($g->has_node('NP_23456')) {....}
 returns   : 1 if true, 0 if false
 arguments : A sequence identifier.

nodes_by_id

 Name      : nodes_by_id
 Purpose   : get node memory address from an id
 Usage     : my @neighbors= $self->neighbors($self->nodes_by_id('O232322'))
 Returns   : a SimpleGraph node representation ( a text representation
             of a node needed for other graph methods e.g.,
             neighbors(), edges()
 Arguments : a protein identifier., e.g., its accession number.

union

 Name        : union
 Purpose     : To merge two graphs together, flagging interactions as 
               duplicate.
 Usage       : $g1->union($g2), where g1 and g2 are 2 graph objects. 
 Returns     : void, $g1 is modified
 Arguments   : A Graph object of the same class as the calling object. 
 Description : This method merges 2 graphs. The calling graph is modified, 
               the parameter graph ($g2) in usage) is unchanged. To take 
               account of differing IDs identifying the same protein, all 
               ids are compared. The following rules are used to modify $g1.

               First of all both graphs are scanned for nodes that share 
               an id in common. 

         1. If 2 nodes(proteins) share an interaction in both graphs,
            the edge in graph 2 is copied to graph 1 and added as a
            duplicate edge to graph 1,

         2. If 2 nodes interact in $g2 but not $g1, but both nodes exist
            in $g1, the attributes of the interaction in $g2 are 
            used to make a new edge in $g1.

         3. If 2 nodes interact in g2 but not g1, and 1 of them is a new
            protein, that protein is put in $g1 and a new edge made to
            it. 

         4. At present, if there is an interaction in $g2 composed of a
            pair of interactors that are not present in $g1, they are 
            not copied to $g1. This is rather conservative but prevents
            the problem of having redundant nodes in $g1 due to the same
            protein being identified by different ids in the same graph.

         So, for example 

              Edge   N1  N2 Comment

    Graph 1:  E1     P1  P2
              E2     P3  P4
              E3     P1  P4

    Graph 2:  X1     P1  P2 - will be added as duplicate to Graph1
              X2     P1  X4 - X4 added to Graph 1 and new edge made
              X3     P2  P3 - new edge links existing proteins in G1
              X4     Z4  Z5 - not added to Graph1. Are these different
                              proteins or synonyms for proteins in G1?

edge_count

 Name     : edge_count
 Purpose  : returns number of unique interactions, excluding 
            redundancies/duplicates
 Arguments: void
 Returns  : An integer
 Usage    : my $count  = $graph->edge_count;

node_count

 Name     : node_count
 Purpose  : returns number of nodes.
 Arguments: void
 Returns  : An integer
 Usage    : my $count = $graph->node_count;

neighbor_count

 Name      : neighbor_count
 Purpose   : returns number of neighbors of a given node
 Usage     : my $count = $gr->neighbor_count($node)
 Arguments : a node object
 Returns   : an integer

_get_ids_by_db

 Name     : _get_ids_by_db
 Purpose  : gets all ids for a node, assuming its Bio::Seq object
 Arguments: A Bio::SeqI object
 Returns  : A hash: Keys are db ids, values are accessions
 Usage    : my %ids = $gr->_get_ids_by_db($seqobj);

add_edge

 Name        : add_edge
 Purpose     : adds an interaction to a graph.
 Usage       : $gr->add_edge($edge)
 Arguments   : a Bio::Graph::Edge object, or a reference to a 2 element list. 
 Returns     : void
 Description : This is the method to use to add an interaction to a graph. 
               It contains the logic used to determine if a graph is a 
               new edge, a duplicate (an existing interaction with a 
               different edge id) or a redundant edge (same interaction, 
               same edge id).

subgraph

 Name      : subgraph
 Purpose   : To construct a subgraph of  nodes from the main network.This 
             method overrides that of Bio::Graph::SimpleGraph in its dealings with 
             Edge objects. 
 Usage     : my $sg = $gr->subgraph(@nodes).
 Returns   : A subgraph of the same class as the original graph. Edge objects are 
             cloned from the original graph but node objects are shared, so beware if you 
             start deleting nodes from the parent graph whilst operating on subgraph nodes. 
 Arguments : A list of node objects.

add_dup_edge

 Name       : add_dup_edge
 Purpose    : to flag an interaction as a duplicate, take advantage of 
              edge ids. The idea is that interactions from 2 sources with 
              different interaction ids can be used to provide more 
              evidence for a interaction being true, while preventing 
              redundancy of the same interaction being present more than 
              once in the same dataset. 
 Returns    : 1 on successful addition, 0 on there being an existing 
              duplicate. 
 Usage      : $gr->add_dup_edge(edge->new (-nodes => [$n1, $n2],
                                           -score => $score
                                           -id    => $id);
 Arguments  : an EdgeI implementing object.
 Descripton : 

edge_by_id

 Name        : edge_by_id
 Purpose     : retrieve data about an edge from its id
 Arguments   : a text identifier
 Returns     : a Bio::Graph::Edge object or undef
 Usage       : my $edge = $gr->edge_by_id('1000E');

remove_dup_edges

 Name        : remove_dup_edges
 Purpose     : removes duplicate edges from graph
 Arguments   : none         - removes all duplicate edges
               edge id list - removes specified edges
 Returns     : void
 Usage       :    $gr->remove_dup_edges()
               or $gr->remove_dup_edges($edgeid1, $edgeid2);

redundant_edge

 Name        : redundant_edge
 Purpose     : adds/retrieves redundant edges to graph
 Usage       : $gr->redundant_edge($edge)
 Arguments   : none (getter) or a Biuo::Graph::Edge object (setter). 
 Description : redundant edges are edges in a graph that have the 
               same edge id, ie. are 2 identical interactions. 
               With edge arg adds it to list, else returns list as reference. 

redundant_edges

 Name         : redundant_edges
 Purpose      : alias for redundant_edge

remove_redundant_edges

 Name        : remove_redundant_edges
 Purpose     : removes redundant_edges from graph, used by remove_node(),
               may be better as an internal method??
 Arguments   : none         - removes all redundant edges
               edge id list - removes specified edges
 Returns     : void
 Usage       :    $gr->remove_redundant_edges()
               or $gr->remove_redundant_edges($edgeid1, $edgeid2);

clustering_coefficient

 Name      : clustering_coefficient
 Purpose   : determines the clustering coefficient of a node, a number 
             in range 0-1 indicating the extent to which the neighbors of
             a node are interconnnected.
 Arguments : A sequence object (preferred) or a text identifier
 Returns   : The clustering coefficient. 0 is a valid result.
             If the CC is not calculable ( if the node has <2 neighbors),
                returns -1.
 Usage     : my $node = $gr->nodes_by_id('P12345');
             my $cc   = $gr->clustering_coefficient($node);

remove_nodes

 Name      : remove_nodes
 Purpose   : to delete a node from a graph, e.g., to simulate effect 
             of mutation
 Usage     : $gr->remove_nodes($seqobj);
 Arguments : a single $seqobj or list of seq objects (nodes)
 Returns   : 1 on success

unconnected_nodes

 Name      : unconnected_nodes
 Purpose   : return a list of nodes with no connections. 
 Arguments : none
 Returns   : an array or array reference of unconnected nodes
 Usage     : my @ucnodes = $gr->unconnected_nodes();

articulation_points

 Name      : articulation_points
 Purpose   : to find edges in a graph that if broken will fragment
               the graph into islands.
 Usage     : my $edgeref = $gr->articulation_points();
             for my $e (keys %$edgeref) {
                                   print $e->[0]->accession_number. "-".
                     $e->[1]->accession_number ."\n";
             }
 Arguments : none
 Returns   : a list references to nodes that will fragment the graph 
             if deleted. 
 Notes     : This is a "slow but sure" method that works with graphs
               up to a few hundred nodes reasonably fast.

is_articulation_point

 Name      : is_articulation_point
 Purpose   : to determine if a given node is an articulation point or not. 
 Usage     : if ($gr->is_articulation_point($node)) {.... 
 Arguments : a text identifier for the protein or the node itself
 Returns   : 1 if node is an articulation point, 0 if it is not