SWISH::Prog::Aggregator::Spider - web aggregator
use SWISH::Prog::Aggregator::Spider; my $spider = SWISH::Prog::Aggregator::Spider->new( indexer => SWISH::Prog::Indexer->new ); $spider->indexer->start; $spider->crawl( 'http://swish-e.org/' ); $spider->indexer->finish;
SWISH::Prog::Aggregator::Spider is a web crawler similar to the spider.pl script in the Swish-e 2.4 distribution. Internally, SWISH::Prog::Aggregator::Spider uses WWW::Mechanize to the hard work. See SWISH::Prog::Aggregator::Spider::UA.
See SWISH::Prog::Aggregator
All params have their own get/set methods too. They include:
Flag as to whether each URI's content should be fingerprinted and compared. Useful if the same content is available under multiple URIs and you only want to index it once.
Get/set the SWISH::Prog::Cache-derived object used to track which URIs have been fetched already.
If use_md5() is true, this SWISH::Prog::cache-derived object tracks the URI fingerprints.
Get/set the SWISH::Prog::Queue-derived object for tracking which URIs still need to be fetched.
Get/set the SWISH::Prog::Aggregagor::Spider::UA object.
How many levels of links to follow. NOTE: This value describes the number of links from the first argument passed to crawl.
Get/set the number of seconds to wait between making requests. Default is 5 seconds (a very friendly delay).
Get/set the number of seconds to wait before considering the remote server unresponsive. The default is 10.
Initializes a new spider object. Called by new().
Returns true if uri is acceptable for including in an index. The 'ok-ness' of the uri is based on it's base, robot rules, and the spider configuration.
Returns the next URI from the queue() as a SWISH::Prog::Doc object, or the error message if there was one.
Returns undef if the queue is empty or max_depth() has been reached.
Implements the required crawl() method. Recursively fetches uri and its child links to a depth set in depth().
Peter Karman, <perl@peknet.com>
Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=SWISH-Prog. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-swish-prog at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc SWISH::Prog
You can also look for information at:
Mailing list
http://lists.swish-e.org/listinfo/users
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=SWISH-Prog
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/SWISH-Prog
CPAN Ratings
http://cpanratings.perl.org/d/SWISH-Prog
Search CPAN
http://search.cpan.org/dist/SWISH-Prog/
Copyright 2008-2009 by Peter Karman
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
http://swish-e.org/
To install SWISH::Prog, copy and paste the appropriate command in to your terminal.
cpanm
cpanm SWISH::Prog
CPAN shell
perl -MCPAN -e shell install SWISH::Prog
For more information on module installation, please visit the detailed CPAN module installation guide.