The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

TM::Corpus - Topic Maps, Document Corpus

SYNOPSIS

   use TM;
   my $tm = ...

   use TM::Corpus;
   my $co = new TM::Corpus (map => $tm);    # bind with map

   $co->useragent (new LWP::UserAgent);     # would be the default anyway

   $co->update;                             # copy all content from the map
   $co->harvest;                            # add documents from the Internet

ABSTRACT

This package connects a topic map instance and a document corpus into one container.

DESCRIPTION

A corpus is normally a set of documents. A topic map based corpus is a set of documents, internal or external to a topic map.

Whenever your topic map is stable, you can first update the corpus with the content and then let an user agent download all documents which are mentioned in the map. With this data corpus you can then do any number of things, one of them having it fulltext-searched.

INTERFACE

Constructor

The constructor accepts a hash as parameter with the following keys:

map (mandatory)

The value must be a TM object. Any map should do.

ua (optional)

You can pass in your own <LWP::UserAgent> object. That is used when you ask to harvest the documents behind occurrence URLs. If you omit that, a stock object will be generated.

Methods

useragent

my $ua = $co->useragent

$co->useragent ($ua)

Read/write accessor for the user agent component.

map

my $tm = $co->map

Read-only access to the underlying map.

resources

Read-only access to the data. Probably not wise to use, but here it is.

update

$co = $co->update

$co->update

This method synchronizes all data from the map into the corpus. The underlying map is the authoritative source, but when it is modified, the corpus is NOT automatically updated. Instead, you should invoke this method at a suitable time.

harvest

$co = $co->harvest

$co->harvest

This method uses the defined user agent to resolve all URLs within the underlying map and to load the content locally. All network related modalities (timeout, limits, etc.) have to be implemented via the user agent.

SEE ALSO

TM::Corpus::SearchAble

COPYRIGHT AND LICENSE

Copyright 200[8] by Robert Barta, <drrho@cpan.org>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.