# NAME

`Text::Document - a text document subject to statistical analysis`

# SYNOPSIS

```
my $t = Text::Document->new();
$t->AddContent( 'foo bar baz' );
$t->AddContent( 'foo barbaz; ' );
my @freqList = $t->KeywordFrequency();
my $u = Text::Document->new();
...
my $sj = $t->JaccardSimilarity( $u );
my $sc = $t->CosineSimilarity( $u );
my $wsc = $t->WeightedCosineSimilarity( $u, \&MyWeight, $rock );
```

# DESCRIPTION

`Text::Document`

allows to perform simple Information-Retrieval-oriented statistics on pure-text documents.

Text can be added in chunks, so that the document may be incrementally built, for instance by a class like `HTML::Parser`

.

A simple algorithm splits the text into terms; the algorithm may be redefined by subclassing and redefining `ScanV`

.

The `KeywordFrequency`

function computes term frequency over the whole document.

# FORESEEN REUSE

The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a `V`

suffix. Redefining other methods will require greater attention.

# CLASS METHODS

## new

The creator method. The optional arguments are in the *(key,value)* form and allow to specify whether all keywords are trasformed to lowercase (default) and whether the string representation (`WriteToString`

) will be compressed (default).

```
my $d = Text::Document->new();
my $dNotCompressed = Text::Document( compressed => 0 );
my $dPreserveCase = Text::Document( lowercase => 0 );
```

## NewFromString

Take a string written by `WriteToString`

(see below) and create a new `Text::Document`

with the same contents; call `die`

whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable.

`my $b = Text::Document::NewFromString( $str );`

The return value is a blessed reference; put in another way, this is an alternative contructor.

The string should have been written by `WriteToString`

; you may of course tweak the string contents, but at this point you're entirely on you own.

# INSTANCE METHODS

## AddContent

Used as

```
$d->AddContent( 'foo bar baz foo9' );
$d->AddContent( 'mary had a little lamb' );
```

Successive calls accumulate content; there is currently no way of resetting the content to zero.

## Terms

Returns a list of all distinct terms in the document, in no particular order.

## Occurrences

Returns the number of occurrences of a given term.

```
$d->AddContent( 'foo baz bar foo foo');
my $n = $d->Occurrences( 'foo' ); # now $n is 3
```

## ScanV

Scan a string and return a list of terms.

Called internally as:

`my @terms = $self->ScanV( $text );`

## KeywordFrequency

Returns a reference list of pairs *[term,frequency]*, sorted by ascending frequency.

```
my $listRef = $d->KeywordFrequency();
foreach my $pair (@{$listRef}){
my ($term,$frequency) = @{$pair};
...
}
```

Terms in the document are sampled and their frequencies of occurrency are sorted in ascending order; finally, the list is returned to the user.

## WriteToString

Convert the document (actually, some parameters and the term counters) into a string which can be saved and later restored with `NewFromString`

.

`my $str = $d->WriteToString();`

The string begins with a header which encodes the originating package, its version, the parameters of the current instance.

Whenever possible, `Compress::Zlib`

is used in order to compress the bit vector in the most efficient way. On systems without `Compress::Zlib`

, the bit string is saved uncompressed.

## JaccardSimilarity

Compute the Jaccard measure of document similarity, which is defined as follows: given two documents *D* and *E*, let *Ds* and *Es* be the set of terms occurring in *D* and *E*, respectively. Define *S* as the intersection of *Ds* and *Es*, and *T* as their union. Then the Jaccerd similarity is the the number of elements of *S* divided by the number of elements of *T*.

It is called as follows:

`my $sim = $d->JaccardSimilarity( $e );`

If neither document has any terms the result is undef (a rare evenience). Otherwise the similarity is a real number between 0.0 (no terms in common) and 1.0 (all terms in common).

## CosineSimilarity

Compute the cosine similarity between two documents *D* and *E*.

Let *Ds* and *Es* be the set of terms occurring in *D* and *E*, respectively. Define *T* as the union of *Ds* and *Es*, and let *ti* be the *i*-th element of *T*.

Then the term vectors of *D* and *E* are

```
Dv = (nD(t1), nD(t2), ..., nD(tN))
Ev = (nE(t1), nE(t2), ..., nE(tN))
```

where nD(ti) is the number of occurrences of term ti in *D*, and nE(ti) the same for *E*.

Now we are at last ready to define the cosine similarity *CS*:

`CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev))`

Here (... , ...) is the scalar product and Norm is the Euclidean norm (square root of the sum of squares).

`CosineSimilarity`

is called as

`$sim = $d->CosineSimilarity( $e );`

It is `undef`

if either *D* or *E* have no occurrence of any term. Otherwise, it is a number between 0.0 and 1.0. Since term occurrences are always non-negative, the cosine is obviously always non-negative.

## WeightedCosineSimilarity

Compute the weighted cosine similarity between two documents *D* and *E*.

In the setting of `CosineSimilarity`

, the term vectors of *D* and *E* are

```
Dv = (nD(t1)*w1, nD(t2)*w2, ..., nD(tN)*wN)
Ev = (nE(t1)*w1, nE(t2)*w2, ..., nE(tN)*wN)
```

The weights are nonnegative real values; each term has associated a weight. To achieve generality, weights may be defined using a function, like:

```
my $wcs = $d->WeightedCosineSimilarity(
$e,
\&function,
$rock
);
```

The `function`

will be called as follows:

`my $weight = function( $rock, 'foo' );`

`$rock`

is a 'constant' object used for passing a *context* to the function.

For instance, a common way of defining weights is the IDF (inverse document frequency), which is defined in Text::DocumentCollection. In this context, you can weigh terms with their IDF as follows:

```
$sim = $c->WeightedCosineSimilarity(
$d,
\&Text::DocumentCollection::IDF,
$collection
);
```

`WeightedCosineSimilarity`

will call

`$collection->IDF( 'foo' );`

which is what we expect.

Actually, we should return the square root of IDF, but this detail is not necessary here.

# AUTHORS

```
spinellia@acm.org (Andrea Spinelli)
walter@humans.net (Walter Vannini)
```

# HISTORY

```
2001-11-02 - initial revision
2001-11-20 - added WeightedCosineSimilarity, suggested by JP Mc Gowan <jp.mcgowan@ucd.ie>
```

We did not use `Storable`

, because we wanted to fine-tune compression and version compatibility. However, this choice may be easily reversed redefining WriteToString and NewFromString.

1 POD Error

The following errors were encountered while parsing the POD:

- Around line 245:
Unknown directive: =head