The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

SWISH::Prog::Aggregator::Spider - web aggregator

SYNOPSIS

 use SWISH::Prog::Aggregator::Spider;
 my $spider = SWISH::Prog::Aggregator::Spider->new(
        indexer => SWISH::Prog::Indexer->new
 );
 
 $spider->indexer->start;
 $spider->crawl( 'http://swish-e.org/' );
 $spider->indexer->finish;

DESCRIPTION

SWISH::Prog::Aggregator::Spider is a web crawler similar to the spider.pl script in the Swish-e 2.4 distribution. Internally, SWISH::Prog::Aggregator::Spider uses WWW::Mechanize to the hard work. See SWISH::Prog::Aggregator::Spider::UA.

METHODS

See SWISH::Prog::Aggregator

new( params )

All params have their own get/set methods too. They include:

use_md5

Flag as to whether each URI's content should be fingerprinted and compared. Useful if the same content is available under multiple URIs and you only want to index it once.

uri_cache

Get/set the SWISH::Prog::Cache-derived object used to track which URIs have been fetched already.

md5_cache

If use_md5() is true, this SWISH::Prog::cache-derived object tracks the URI fingerprints.

queue

Get/set the SWISH::Prog::Queue-derived object for tracking which URIs still need to be fetched.

ua

Get/set the SWISH::Prog::Aggregagor::Spider::UA object.

max_depth

How many levels of links to follow. NOTE: This value describes the number of links from the first argument passed to crawl.

delay

Get/set the number of seconds to wait between making requests. Default is 5 seconds (a very friendly delay).

timeout

Get/set the number of seconds to wait before considering the remote server unresponsive. The default is 10.

init

Initializes a new spider object. Called by new().

uri_ok( uri )

Returns true if uri is acceptable for including in an index. The 'ok-ness' of the uri is based on it's base, robot rules, and the spider configuration.

get_doc

Returns the next URI from the queue() as a SWISH::Prog::Doc object, or the error message if there was one.

Returns undef if the queue is empty or max_depth() has been reached.

crawl( uri )

Implements the required crawl() method. Recursively fetches uri and its child links to a depth set in depth().

AUTHOR

Peter Karman, <perl@peknet.com>

BUGS

Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=SWISH-Prog. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc SWISH::Prog

You can also look for information at:

COPYRIGHT AND LICENSE

Copyright 2008-2009 by Peter Karman

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

http://swish-e.org/