NAME
Text::Document - a text document subject to statistical analysis
SYNOPSIS
my $t = Text::Document->new();
$t->AddContent( 'foo bar baz' );
$t->AddContent( 'foo barbaz; ' );
my @freqList = $t->KeywordFrequency();
my $u = Text::Document->new();
...
my $sj = $t->JaccardSimilarity( $u );
my $sc = $t->CosineSimilarity( $u );
my $wsc = $t->WeightedCosineSimilarity( $u, \&MyWeight, $rock );
DESCRIPTION
Text::Document
allows to perform simple Information-Retrieval-oriented statistics on pure-text documents.
Text can be added in chunks, so that the document may be incrementally built, for instance by a class like HTML::Parser
.
A simple algorithm splits the text into terms; the algorithm may be redefined by subclassing and redefining ScanV
.
The KeywordFrequency
function computes term frequency over the whole document.
FORESEEN REUSE
The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a V
suffix. Redefining other methods will require greater attention.
CLASS METHODS
new
The creator method. The optional arguments are in the (key,value) form and allow to specify whether all keywords are trasformed to lowercase (default) and whether the string representation (WriteToString
) will be compressed (default).
my $d = Text::Document->new();
my $dNotCompressed = Text::Document( compressed => 0 );
my $dPreserveCase = Text::Document( lowercase => 0 );
NewFromString
Take a string written by WriteToString
(see below) and create a new Text::Document
with the same contents; call die
whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable.
my $b = Text::Document::NewFromString( $str );
The return value is a blessed reference; put in another way, this is an alternative contructor.
The string should have been written by WriteToString
; you may of course tweak the string contents, but at this point you're entirely on you own.
INSTANCE METHODS
AddContent
Used as
$d->AddContent( 'foo bar baz foo9' );
$d->AddContent( 'mary had a little lamb' );
Successive calls accumulate content; there is currently no way of resetting the content to zero.
Terms
Returns a list of all distinct terms in the document, in no particular order.
Occurrences
Returns the number of occurrences of a given term.
$d->AddContent( 'foo baz bar foo foo');
my $n = $d->Occurrences( 'foo' ); # now $n is 3
ScanV
Scan a string and return a list of terms.
Called internally as:
my @terms = $self->ScanV( $text );
KeywordFrequency
Returns a reference list of pairs [term,frequency], sorted by ascending frequency.
my $listRef = $d->KeywordFrequency();
foreach my $pair (@{$listRef}){
my ($term,$frequency) = @{$pair};
...
}
Terms in the document are sampled and their frequencies of occurrency are sorted in ascending order; finally, the list is returned to the user.
WriteToString
Convert the document (actually, some parameters and the term counters) into a string which can be saved and later restored with NewFromString
.
my $str = $d->WriteToString();
The string begins with a header which encodes the originating package, its version, the parameters of the current instance.
Whenever possible, Compress::Zlib
is used in order to compress the bit vector in the most efficient way. On systems without Compress::Zlib
, the bit string is saved uncompressed.
JaccardSimilarity
Compute the Jaccard measure of document similarity, which is defined as follows: given two documents D and E, let Ds and Es be the set of terms occurring in D and E, respectively. Define S as the intersection of Ds and Es, and T as their union. Then the Jaccerd similarity is the the number of elements of S divided by the number of elements of T.
It is called as follows:
my $sim = $d->JaccardSimilarity( $e );
If neither document has any terms the result is undef (a rare evenience). Otherwise the similarity is a real number between 0.0 (no terms in common) and 1.0 (all terms in common).
CosineSimilarity
Compute the cosine similarity between two documents D and E.
Let Ds and Es be the set of terms occurring in D and E, respectively. Define T as the union of Ds and Es, and let ti be the i-th element of T.
Then the term vectors of D and E are
Dv = (nD(t1), nD(t2), ..., nD(tN))
Ev = (nE(t1), nE(t2), ..., nE(tN))
where nD(ti) is the number of occurrences of term ti in D, and nE(ti) the same for E.
Now we are at last ready to define the cosine similarity CS:
CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev))
Here (... , ...) is the scalar product and Norm is the Euclidean norm (square root of the sum of squares).
CosineSimilarity
is called as
$sim = $d->CosineSimilarity( $e );
It is undef
if either D or E have no occurrence of any term. Otherwise, it is a number between 0.0 and 1.0. Since term occurrences are always non-negative, the cosine is obviously always non-negative.
WeightedCosineSimilarity
Compute the weighted cosine similarity between two documents D and E.
In the setting of CosineSimilarity
, the term vectors of D and E are
Dv = (nD(t1)*w1, nD(t2)*w2, ..., nD(tN)*wN)
Ev = (nE(t1)*w1, nE(t2)*w2, ..., nE(tN)*wN)
The weights are nonnegative real values; each term has associated a weight. To achieve generality, weights may be defined using a function, like:
my $wcs = $d->WeightedCosineSimilarity(
$e,
\&function,
$rock
);
The function
will be called as follows:
my $weight = function( $rock, 'foo' );
$rock
is a 'constant' object used for passing a context to the function.
For instance, a common way of defining weights is the IDF (inverse document frequency), which is defined in Text::DocumentCollection. In this context, you can weigh terms with their IDF as follows:
$sim = $c->WeightedCosineSimilarity(
$d,
\&Text::DocumentCollection::IDF,
$collection
);
WeightedCosineSimilarity
will call
$collection->IDF( 'foo' );
which is what we expect.
Actually, we should return the square root of IDF, but this detail is not necessary here.
AUTHORS
spinellia@acm.org (Andrea Spinelli)
walter@humans.net (Walter Vannini)
HISTORY
2001-11-02 - initial revision
2001-11-20 - added WeightedCosineSimilarity, suggested by JP Mc Gowan <jp.mcgowan@ucd.ie>
We did not use Storable
, because we wanted to fine-tune compression and version compatibility. However, this choice may be easily reversed redefining WriteToString and NewFromString.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 245:
Unknown directive: =head