The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

SWISH::Prog::Aggregator - document aggregation base class

SYNOPSIS

 package MyAggregator;
 use strict;
 use base qw( SWISH::Prog::Aggregator );
 
 sub get_doc {
    my ($self, $url) = @_;
    
    # do something to create a SWISH::Prog::Doc object from $url
    
    return $doc;
 }
 
 sub crawl {
    my ($self, @where) = @_;
    
    foreach my $place (@where) {
       
       # do something to search $place for docs to pass to get_doc()
       
    }
 }
 
 1;

DESCRIPTION

SWISH::Prog::Aggregator is a base class that defines the basic API for writing an aggregator. Only two methods are required: get_doc() and crawl(). See the SYNOPSIS for the prototypes.

See SWISH::Prog::Aggregator::FS and SWISH::Prog::Aggregator::Spider for examples of aggregators that crawl the filesystem and web, respectively.

METHODS

init

Set object flags per SWISH::Prog::Class API. These are also accessors, and include:

set_parser_from_type

This will set the parser() value in swish_filter() based on the MIME type of the doc_class() object.

indexer

A SWISH::Prog::Indexer object.

doc_class

The name of the SWISH::Prog::Doc-derived class to use in get_doc(). Default is SWISH::Prog::Doc.

swish_filter_obj

A SWISH::Filter object. If not passed in new() one is created for you.

test_mode

Dry run mode, just prints info on stderr but does not build index.

filter

Value should be a CODE ref. This is passed through to set_filter(); there is no filter mutator method.

ok_if_newer_than

Value should be a Unix timestamp (epoch seconds). Default is undef. If set, aggregators should skip files that have a modification time older than the timestamp.

You may get/set the ok_if_newer_than value with the ok_if_newer_than() attribute method, but use set_ok_if_newer_than() to include validation of the supplied timestamp value.

progress_size( n )

If set (defaults to 1000), the Aggregator may choose to report progress every <n> doc crawl()ed. The FS Aggregator (for example) will print a line to stdout every n docs.

config

Returns the SWISH::Prog::Config object from the Indexer being used. This is a read-only method (accessor not mutator).

count

Returns the total number of doc_class() objects returned by get_doc().

crawl( @where )

Override this method in your subclass. It does the aggregation, and passes each doc_class() object from get_doc() to indexer->process().

get_doc( url )

Override this method in your subclass. Should return a doc_class() object.

swish_filter( doc_class_object )

Passes the content() of the SPD object through SWISH::Filter and transforms it to something index-able. Returns the doc_class_object, filtered.

NOTE: This method should be called by all aggregators after get_doc() and before passing to the indexer().

See the SWISH::Filter documentation.

set_filter( code_ref )

Use code_ref as the doc_class filter. This method called by init() if filter param set in constructor.

set_ok_if_newer_than( timestamp )

Set the ok_if_newer_than attribute. timestamp should be a Unix epoch value.

AUTHOR

Peter Karman, <perl@peknet.com>

BUGS

Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=SWISH-Prog. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc SWISH::Prog

You can also look for information at:

COPYRIGHT AND LICENSE

Copyright 2008-2009 by Peter Karman

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

http://swish-e.org/