The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CrawlerCommons::RobotRules - the result of a parsed robots.txt

SYNOPSIS

 use CrawlerCommons::RobotRules;
 use CrawlerCommons::RobotRulesParser;

 my $rules_parser = CrawlerCommons::RobotRulesParser->new;
 
 my $content = "User-agent: *\r\nDisallow: *images";
 my $content_type = "text/plain";
 my $robot_names = "any-old-robot";
 my $url = "http://domain.com/";

 my $robot_rules =
   $rules_parser->parse_content($url, $content, $content_type, $robot_names);

 # obtain the 'mode' of the robot rules object
 say "Anything Goes!!!!" if $robot_rules->is_allow_all;
 say "Nothing to see here!" if $robot_rules->is_allow_none;
 say "Default robot rules mode..." if $robot_rules->is_allow_some;

 # are we allowed to crawl a URL (returns 1 if so, 0 if not)
 say "We're allowed to crawl the index :)"
  if $robot_rules->is_allowed( "https://www.domain.com/index.html");

 say "Not allowed to crawl: $_" unless $robot_rules->is_allowed( $_ )
   for ("http://www.domain.com/images/some_file.png",
        "http://www.domain.com/images/another_file.png");

DESCRIPTION

This object is the result of parsing a single robots.txt file

VERSION

Version 0.03

METHODS

my $true_or_false = $robot_rules->is_allowed( $url )

Returns 1 if we're allowed to crawl the URL represented by $url and 0 otherwise. Will return 1 if the method is_allow_all() returns true, otherwise, if is_allow_none is false, returns 1 if there is an allow rule or no disallow rule for this URL.

  • $url

    The URL whose path is used to search for a matching rule within the object for evaluation.

AUTHOR

Adam Robinson <akrobinson74@gmail.com>