The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Plack::Middleware::DetectRobots - Automatically set a flag in the environment if a robot client is detected

VERSION

version 0.03

SYNOPSIS

  use Plack::Builder;

  my $app = sub { ... } # as usual

  builder {
      enable 'DetectRobots';
          # or: enable 'DetectRobots', env_key => 'psgix.robot_client';
          # or: enable 'DetectRobots', extended_check => 1, generic_check => 1;
      $app;
  };

  # ... and later ...
  
  if ( $env->{'robot_client'} ) {
      # ... do something ...
  }

DESCRIPTION

This Plack middleware uses the list of robots that is part of the AWStats log analyzer software package to analyse the User-Agent HTTP header and to set an environment flag to either a true or false value depending on the detection of a robot client.

Once activated it checks the User-Agent HTTP header against a basic list of patterns for common bots.

If you activate the appropriate options, it can also use an extended list for the detection of less common bots (cf. extended_check) and / or a list of quite generic patterns to detect unknown bots (cf. generic_check).

You may also pass in your own regular expression as a string for further checks (cf. <local_regexp>).

The checks are executed in this order:

1. Local regular expression

2. Basic check

3. Extended check

4. Generic check

If a check yields a positive result (i.e.: detects a bot) the remaining checks are skipped.

Depending on the check which detected a bot, the environment flag is set to one of these values: LOCAL, BASIC, EXTENDED, or GENERIC.

If no bot is detected, the flag is set to 0.

The default name of the flag in the environment is robot_client, but this can be customized by setting the env_key option when enabling this middleware.

It might make sense to use psgix.robot_client by default instead, but the PSGI spec states that the "'psgix.' prefix is reserved for officially blessed extensions" - which does not apply to this module. You may, however, set the key to psgix.robot_client yourself by using the env_key option mentioned before.

WARNING

This software is currently considered BETA and still needs to be seriously tested!

ROBOTS LIST

Based on Revision 2d289e, 2014-11-20 of http://sourceforge.net/p/awstats/code/ci/develop/tree/wwwroot/cgi-bin/lib/robots.pm.

Note: that list might be somewhat dated, as I did not find bingbot in the list of common bots (only in the extended list) while it's predecessor msnbot was considered common.

CONFIGURATION

You may specify the following option when enabling the middleware:

env_key

Set the name of the entry in the environment hash.

basic_check

You may deactivate the standard checks by setting this option to a false value. E.g. if your are only interested in obscure bots or in your local pattern checks.

By setting this option to a false value while simultaneously passing a regular expression to local_regexp one can imitate the behaviour of Plack::Middleware::BotDetector.

extended_check

Determines if an extended list of less often seen robots is also checked for. By default, only common robots are checked for, because the extended check requires a rather large and complex regular expression. Set this param to a true value to change the default behaviour.

generic_check

Determines if the User-Agent string is also analysed to determine if it contains certain strings that generically identify the client as a bot, e.g. "spider" or "crawler" By default, this check is not performed, even though it uses only a relatively short and simple regex.. Set this param to a true value to change the default behaviour.

local_regexp

You may optionally pass in your own regular expression (as a Regexp object using qr//) to check for additional patterns in the User-Agent string.

SEE ALSO

Plack, Plack::Middleware, Plack::Middleware::BotDetector, http://awstats.org/

The functionality provided by Plack::Middleware::BotDetector is basically the same as that of this module, but it requires you to pass in your own regular expression and does not include a default list of known bots.

AUTHOR

Heiko Jansen <hjansen@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2015 by Heiko Jansen.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.