Spreading activation search is a useful way of getting something called "expanded recall", where you can type in a keyword query and get back relevant  results that may contain an exact keyword match, as if by magic.    The magic in this case is very straightforward, but you don't have to tell your boss that.

Most keyword search engines work by building some kind of a lookup table, in many cases a reverse index of words to documents.  This module behaves a little differently, cr

This module is an implementation of a search technique called 'contextual
network search', or 'spreading activation search'.

The idea is to represent a document collection as a set of document and term
nodes in a bipartite graph.  How you generate the term list is up to you - our
own approach is to extract all nouns and noun phrases using a part-of-speech
tagger (see L<Lingua::EN::Tagger>).

Documents and terms are connected by weighted edges.  Weights on the edges are a
function of your choice of weighting algorithm.   The only restriction is that
weights must not exceed 1.

We search the graph by energizing a query node with an arbitrary starting energy
E.  We then distribute that energy among the neighbor nodes, according to the
following formula.  First, divide the energy by the number of neighbor nodes -
call this new value S.   For example, if the starting energy is 10,000, and our
node has five neighbors, S = 2000 units.   Next, determine whether S exceeds our
arbitrary threshold.    If S is less than the threshold, we stop propagating.
If S exceeds the threshold, we assign energies to all the neighbor nodes, and
recurse down.

The energy assigned to each neighbor node will depend on the weight of the edge
connecting it to the starting node.  Since this weight is guaranteed to fall
between 0 and 1, the maximum energy a neighbor node can receive is S.

The module uses a pure Perl implementation, or a compiled C implementation.  The
principal difference between the two is memory requirements - the C version is about half 
the size of the Perl one for a given document collection.   You can select the
C version by passing a parameter to the constructor:

	my $cg = Search::ContextGraph->new( 'xs' => 1 );

Further details are available in the module documentation.


The module can take either a hash of document titles and term lists, or a term-
document matrix (TDM) file.  The first format looks like this

		TITLE => { 
					WORD => COUNT,
			   		WORD => COUNT,
The TDM input format is useful for very large collections.  The TDM file is a plain text file with the following format:

Arbitrary text
Arbitrary text
Arbitrary text

The first two lines are filler, the third line contains the number of unique terms and unique docs, the fourth line is filler.

The format for subsequent lines is as follows:

A is the number of keywords in the document
B is a unique identifier for the keyword
C is the weight to assign the edge between keyword and document

So a line like:

2 12 0.233 23 0.91

Would mean the document contains two keywords, with identifiers 12 and 23, and 
that the edges in the graph should be set to 0.233 and 0.91 respectively.

You may be wondering where the unique identifier for the document is.  The answer is - the program determines it based on the position in the file.  
So the first data line corresponds to document 0, the second line is document 1, etc.


To install this module type the following:

   perl Makefile.PL
   make test
   make install


Copyright (C) 2003 Maciej Ceglowski, Schuyler Erle

This is free software.  See LICENSE for details.