Dezi::Aggregator::Spider - web aggregator
use Dezi::Aggregator::Spider; my $spider = Dezi::Aggregator::Spider->new( indexer => Dezi::Indexer->new ); $spider->indexer->start; $spider->crawl( 'http://swish-e.org/' ); $spider->indexer->finish;
Dezi::Aggregator::Spider is a web crawler similar to the spider.pl script in the Swish-e 2.4 distribution. Internally, Dezi::Aggregator::Spider uses LWP::RobotUA to do the hard work. See Dezi::Aggregator::Spider::UA.
See Dezi::Aggregator.
All params have their own get/set methods too. They include:
Get/set the user-agent string reported by the user agent.
Get/set the email string reported by the user agent.
Flag as to whether each URI's content should be fingerprinted and compared. Useful if the same content is available under multiple URIs and you only want to index it once.
Get/set the Dezi::Cache-derived object used to track which URIs have been fetched already.
If use_md5() is true, this Dezi::Cache-derived object tracks the URI fingerprints.
Apply File::Rules object in uri_ok(). File_Rules_or_ARRAY should be a File::Rules object or an array of strings suitable to passing to File::Rules->new().
Get/set the Dezi::Queue-derived object for tracking which URIs still need to be fetched.
Get/set the Dezi::Aggregator::Spider::UA object.
How many levels of links to follow. NOTE: This value describes the number of links from the first argument passed to crawl.
Default is unlimited depth.
This optional key will set the max minutes to spider. Spidering for this host will stop after max_time seconds, and move on to the next server, if any. The default is to not limit by time.
max_time
This optional key sets the max number of files to spider before aborting. The default is to not limit by number of files. This is the number of requests made to the remote server, not the total number of files to index (see max_indexed). This count is displayed at the end of indexing as Unique URLs.
max_indexed
Unique URLs
This feature can (and perhaps should) be use when spidering a web site where dynamic content may generate unique URLs to prevent run-away spidering.
This optional key sets the max size of a file read from the web server. This defaults to 5,000,000 bytes. If the size is exceeded the resource is truncated per LWP::UserAgent.
Set max_size to zero for unlimited size.
This optional parameter will skip any URIs that do not report having been modified since date. The Last-Modified HTTP header is used to determine modification time.
Last-Modified
This optional parameter will enable keep alive requests. This can dramatically speed up spidering and reduce the load on server being spidered. The default is to not use keep alives, although enabling it will probably be the right thing to do.
To get the most out of keep alives, you may want to set up your web server to allow a lot of requests per single connection (i.e MaxKeepAliveRequests on Apache). Apache's default is 100, which should be good.
When a connection is not closed the spider does not wait the "delay" time when making the next request. In other words, there is no delay in requesting documents while the connection is open.
Note: you must have at least libwww-perl-5.53_90 installed to use this feature.
Get/set the number of seconds to wait between making requests. Default is 5 seconds (a very friendly delay).
Get/set the number of seconds to wait before considering the remote server unresponsive. The default is 10.
CODE reference to fetch username/password credentials when necessary. See also credentials.
credentials
Number of seconds to wait before skipping manual prompt for username/password.
String with username:password pair to be used when prompted by the server.
username
password
By default, 3xx responses from the server will be followed when they are on the same hostname. Set to false (0) to not follow redirects.
TODO
Microsoft server hack.
ARRAY ref of hostnames to be treated as identical to the original host being spidered. By default the spider will not follow links to different hosts.
Initializes a new spider object. Called by new().
Returns true if uri is acceptable for including in an index. The 'ok-ness' of the uri is based on its base, robot rules, and the spider configuration.
Add uri to the queue.
Return next uri from queue.
Returns queue()->size().
Calls queue()->remove(uri).
Returns the next URI from the queue() as a Dezi::Indexer::Doc object, or the error message if there was one.
Returns undef if the queue is empty or max_depth() has been reached.
Called internally when the server returns a 401 or 403 response. Will attempt to determine the correct credentials for uri based on the previous attempt in response and what you have configured in credentials, authn_callback or when manually prompted.
Called internally to perform naive heuristics on http_response to determine whether it looks like an XML feed of some kind, rather than a HTML page.
Called internally to perform naive heuristics on http_response to determine whether it looks like a XML sitemap feed, rather than a HTML page.
Implements the required crawl() method. Recursively fetches uri and its child links to a depth set in max_depth().
Will quit after max_files() unless max_files==0.
Will quit after max_time() seconds unless max_time==0.
Passes args to Dezi::Utils::write_log().
Pass through to Dezi::Utils::write_log_line().
Peter Karman, <perl@peknet.com>
Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Dezi-App. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-swish-prog at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc Dezi
You can also look for information at:
Mailing list
http://lists.swish-e.org/listinfo/users
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Dezi-App
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/Dezi-App
CPAN Ratings
http://cpanratings.perl.org/d/Dezi-App
Search CPAN
http://search.cpan.org/dist/Dezi-App/
Copyright 2008-2018 by Peter Karman
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
http://swish-e.org/
To install Dezi::App, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Dezi::App
CPAN shell
perl -MCPAN -e shell install Dezi::App
For more information on module installation, please visit the detailed CPAN module installation guide.