WWW::RobotRules::Parser::MultiValue - Parse robots.txt
use WWW::RobotRules::Parser::MultiValue; use LWP::Simple qw(get); my $url = 'http://example.com/robots.txt'; my $robots_txt = get $url; my $rules = WWW::RobotRules::Parser::MultiValue->new( agent => 'TestBot/1.0', ); $rules->parse($url, $robots_txt); if ($rules->allows('http://example.com/some/path')) { my $delay = $rules->delay_for('http://example.com/'); sleep $delay; ... } my $hash = $rules->rules_for('http://example.com/'); my @list_of_allowed_paths = $hash->get_all('allow'); my @list_of_custom_rule_value = $hash->get_all('some-rule');
WWW::RobotRules::Parser::MultiValue is a parser for robots.txt.
WWW::RobotRules::Parser::MultiValue
robots.txt
Parsed rules for the specified user agent is stored as a Hash::MultiValue, where the key is a lower case rule name.
Request-rate rule is handled specially. It is normalized to Crawl-delay rule.
Request-rate
Crawl-delay
$rules = WWW::RobotRules::Parser::MultiValue->new( aget => $user_agent ); $rules = WWW::RobotRules::Parser::MultiValue->new( aget => $user_agent, ignore_default => 1, );
Creates a new object to handle rules in robots.txt. The object parses rules match with $user_agent. The rules of User-agent: * always match and have a lower precedence than the rules explicitly matched with $user_agent. If ignore_default option is specified, rules of User-agent: * are simply ignored.
$user_agent
User-agent: *
ignore_default
$rules->parse($uri, $text);
Parses a text content $text whose URI is $uri.
$text
$uri
$rules->match_ua($pattern);
Test if the user agent matches with $pattern.
$pattern
$hash = $rules->rules_for($uri);
Returns a Hash::MultiValue, which describes the rules of the domain of $uri.
Hash::MultiValue
$test = $rules->allows($uri);
Tests if the user agent is allowed to visit $uri. If there is 'Allow' rule for the path of $uri, then the $uri is allowed to visit. If there is 'Disallow' rule for the path of $uri, then the $uri is not allowed to visit. Otherwise, the $uri is allowed to visit.
$delay = $rules->delay_for($uri); $delay_in_milliseconds = $rules->delay_for($uri, 1000);
Calculate a crawl delay for the specified $uri. The value is determined by 'Crawl-delay' rule or 'Request-rate' rule. The second argument specifies the base of the return value.
Copyright (C) INA Lintaro
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
INA Lintaro <tarao.gnn@gmail.com>
To install WWW::RobotRules::Parser::MultiValue, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WWW::RobotRules::Parser::MultiValue
CPAN shell
perl -MCPAN -e shell install WWW::RobotRules::Parser::MultiValue
For more information on module installation, please visit the detailed CPAN module installation guide.