- SEE ALSO
ElephantAgent - the agent that never forgets
This is the robot agent that never forgets. One of the major advantages of the original MOMspider link checker was that it didn't need to keep checking robots.txt files every time it was started. This agent does the same by using a disk cache of hosts and status.
Why bother:- just use a cacheing server.. (because we would know to recall the robots.txt when needed..)
a host keeps this state
last - time of last visit count - number of visits last_robot - time of last robot check robot_stat - robot status (exclude, open, controled) robots_txt - robot file
This has to be implemented as a complete rewrite of the RobotUA because that assumes multi-level hashes (can't do with MLDBM) and because it all directly accesses the contents of its own hash..
It is possible that people will be running several different user agents in one program (why?) but then wish to share robot exclusion info between them.
There are many decisions to make:-
should I cache this robots.txt? only if relatively short, or if we use this site often.. should I recheck a robots.txt yes if more than $max_hits to the site yes if more than $max_time since last check yes if more than $max_size from site $max_size = 1000 * $robots_txt_size $max_hits = 1000 $max_time = three_weeks (we should generally use head for re-checking)
package LWP::ElephantUA; $REVISION=q$Revision: 1.3 $ ; $VERSION = sprintf ( "%d.%02d", $REVISION =~ /(\d+).(\d+)/ );
require LWP::UserAgent; @ISA = qw(LWP::UserAgent);
require WWW::RobotRules; require HTTP::Request; require HTTP::Response;
use Carp (); use LWP::Debug (); use HTTP::Status (); use HTTP::Date qw(time2str);
LWP::RobotUA - A class for Web Robots
require LWP::RobotUA; $ua = new LWP::RobotUA 'my-robot/0.1', 'firstname.lastname@example.org'; $ua->delay(10); # be very nice, go slowly ... # just use it just like a normal LWP::UserAgent $res = $ua->request($req);
This class implements a user agent that is suitable for robot applications. Robots should be nice to the servers they visit. They should consult the /robots.txt file to ensure that they are welcomed and they should not send too frequent requests.
But, before you consider writing a robot take a look at <URL:http://info.webcrawler.com/mak/projects/robots/robots.html>.
When you use a LWP::RobotUA as your user agent, then you do not really have to think about these things yourself. Just send requests as you do when you are using a normal LWP::UserAgent and this special agent will make sure you are nice.
The LWP::RobotUA is a sub-class of LWP::UserAgent and implements the same methods. The use_alarm() method also desides whether we will wait if a request is tried too early (if true), or will return an error response (if false).
In addition these methods are provided:
A name and the mail address of the human running the the robot is required by the constructor. The name can be changed later though the agent() method. The mail address chan be changed with the from() method.
Set the minimum delay between requests to the same server. The default is 1 minute.
Returns the number of documents fetched from this server host.
Returns the number of seconds you must wait before you can make a new request to this host.
Returns a text that describe the state of the UA. Mainly useful for debugging.
Gisle Aas <email@example.com>