CrawlerCommons::RobotRules - the result of a parsed robots.txt
use CrawlerCommons::RobotRules; use CrawlerCommons::RobotRulesParser; my $rules_parser = CrawlerCommons::RobotRulesParser->new; my $content = "User-agent: *\r\nDisallow: *images"; my $content_type = "text/plain"; my $robot_names = "any-old-robot"; my $url = "http://domain.com/"; my $robot_rules = $rules_parser->parse_content($url, $content, $content_type, $robot_names); # obtain the 'mode' of the robot rules object say "Anything Goes!!!!" if $robot_rules->is_allow_all; say "Nothing to see here!" if $robot_rules->is_allow_none; say "Default robot rules mode..." if $robot_rules->is_allow_some; # are we allowed to crawl a URL (returns 1 if so, 0 if not) say "We're allowed to crawl the index :)" if $robot_rules->is_allowed( "https://www.domain.com/index.html"); say "Not allowed to crawl: $_" unless $robot_rules->is_allowed( $_ ) for ("http://www.domain.com/images/some_file.png", "http://www.domain.com/images/another_file.png");
This object is the result of parsing a single robots.txt file
Version 0.03
my $true_or_false = $robot_rules->is_allowed( $url )
Returns 1 if we're allowed to crawl the URL represented by $url and 0 otherwise. Will return 1 if the method is_allow_all() returns true, otherwise, if is_allow_none is false, returns 1 if there is an allow rule or no disallow rule for this URL.
$url
is_allow_all()
is_allow_none
The URL whose path is used to search for a matching rule within the object for evaluation.
Adam Robinson <akrobinson74@gmail.com>
To install CrawlerCommons::RobotRulesParser, copy and paste the appropriate command in to your terminal.
cpanm
cpanm CrawlerCommons::RobotRulesParser
CPAN shell
perl -MCPAN -e shell install CrawlerCommons::RobotRulesParser
For more information on module installation, please visit the detailed CPAN module installation guide.