The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

NoStopRobot - a robot doesn't stop, but remembers where it has been

SYNOPSIS

  use NoStopRobot
  my $ua=NoStopRobot::new(....)
  $ua->check_wait

DESCRIPTION

This module implements a user agent which remembers where it has been and when so that the user can avoid too fast visits, but doesn't actual implement that wait.

ROBOT LOGIC

The robot logic implemented here is somewhat more aggressive than that implemented in WWW::RobotUA. We never actually sleep in any of the functions here. This means that if a request is initiated it will complete with robot checks and redirects all in one go.

Instead the user should actually implement waits outside the module using the `host_wait()' method. The key benefit of this is that it is possible to check which request can be run first and reorder requests to work as fast as wanted whilst maintaining good load spread between different sites.

Secondly (and as a direct consequence), if there are multiple requests to different sites which end up as redirects to the same site, the wait time logic will not warn against this. This is reasonable since each request can be considered as a separate request to a separate site.

IMPLEMENTATION NOTES

Becuase LWP::RobotUA collapses completely when called with URLs other than HTTP this is implemented over the top of LWP::UserAgent (via LWP::Auth_UA) rather than as a subclass of LWP::RobotUA.

$self->robot_check($url)

robot_check - given a URL carries out all actions needed to check whether a request to that URL will be allowed by the robot rules but doesn't actually send a request to the URL its self. This if the function host_wait is called then it will accurately reflect the time before a request can be made to that URL.

simple_request

simple_request carries out one HTTP request. It does robot checks to ensure that the request is permitted, however, in contrast to RobotUA it never sleeps. It merely records which sites it visits.

N.B. there is one theoretical hole in this logic. If multiple sites are redirected to the same site, it is possible for us to check

$ua->host_wait($url)

This funciton is like host_wait; but there are two differences. Firstly, it should be called with a url (string or object). Secondly, it should work for any url (actually URI), but will return undef for urls which can't have a netloc derived from them.

$ua->host_wait($netloc)

Returns the number of seconds (from now) you must wait before you can make a new request to this host.

$ua->no_wait($regex)

Sets a regular expression for links for which the robot agent should not wait. Typically these would be local pages or servers in the same organisation as the link checking is being carried out by.

$ua = LWP::RobotUA->new($agent_name, $from, [$rules])

Your robot's name and the mail address of the human responsible for the robot (i.e. you) are required by the constructor.

Optionally it allows you to specify the WWW::RobotRules object to use.

$ua->delay([$minutes])

Set the minimum delay between requests to the same server. The default is 1 minute.

$ua->use_sleep([$boolean])

Get/set a value indicating whether the UA should sleep() if requests arrive too fast (before $ua->delay minutes has passed). The default is TRUE. If this value is FALSE then an internal SERVICE_UNAVAILABLE response will be generated. It will have an Retry-After header that indicates when it is OK to send another request to this server.

$ua->rules([$rules])

Set/get which WWW::RobotRules object to use.

$ua->no_visits($netloc)

Returns the number of documents fetched from this server host. Yes I know, this method should probably have been named num_visits() or something like that. :-(

$ua->as_string

Returns a string that describes the state of the UA. Mainly useful for debugging.

4 POD Errors

The following errors were encountered while parsing the POD:

Around line 134:

'=item' outside of any '=over'

Around line 158:

You forgot a '=back' before '=head1'

Around line 249:

'=item' outside of any '=over'

=over without closing =back

Around line 624:

=cut found outside a pod block. Skipping to next block.