The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::RobotRules::Parser::MultiValue - Parse robots.txt

SYNOPSIS

    use WWW::RobotRules::Parser::MultiValue;
    use LWP::Simple qw(get);

    my $url = 'http://example.com/robots.txt';
    my $robots_txt = get $url;

    my $rules = WWW::RobotRules::Parser::MultiValue->new(
        agent => 'TestBot/1.0',
    );
    $rules->parse($url, $robots_txt);

    if ($rules->allows('http://example.com/some/path')) {
        my $delay = $rules->delay_for('http://example.com/');
        sleep $delay;
        ...
    }

    my $hash = $rules->rules_for('http://example.com/');
    my @list_of_allowed_paths = $hash->get_all('allow');
    my @list_of_custom_rule_value = $hash->get_all('some-rule');

DESCRIPTION

WWW::RobotRules::Parser::MultiValue is a parser for robots.txt.

Parsed rules for the specified user agent is stored as a Hash::MultiValue, where the key is a lower case rule name.

Request-rate rule is handled specially. It is normalized to Crawl-delay rule.

METHODS

new
    $rules = WWW::RobotRules::Parser::MultiValue->new(
        aget => $user_agent
    );
    $rules = WWW::RobotRules::Parser::MultiValue->new(
        aget => $user_agent,
        ignore_default => 1,
    );

Creates a new object to handle rules in robots.txt. The object parses rules match with $user_agent. The rules of User-agent: * always match and have a lower precedence than the rules explicitly matched with $user_agent. If ignore_default option is specified, rules of User-agent: * are simply ignored.

parse
    $rules->parse($uri, $text);

Parses a text content $text whose URI is $uri.

match_ua
    $rules->match_ua($pattern);

Test if the user agent matches with $pattern.

rules_for
    $hash = $rules->rules_for($uri);

Returns a Hash::MultiValue, which describes the rules of the domain of $uri.

allows
    $test = $rules->allows($uri);

Tests if the user agent is allowed to visit $uri. If there is 'Allow' rule for the path of $uri, then the $uri is allowed to visit. If there is 'Disallow' rule for the path of $uri, then the $uri is not allowed to visit. Otherwise, the $uri is allowed to visit.

delay_for
    $delay = $rules->delay_for($uri);
    $delay_in_milliseconds = $rules->delay_for($uri, 1000);

Calculate a crawl delay for the specified $uri. The value is determined by 'Crawl-delay' rule or 'Request-rate' rule. The second argument specifies the base of the return value.

SEE ALSO

Hash::MultiValue

LICENSE

Copyright (C) INA Lintaro

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

INA Lintaro <tarao.gnn@gmail.com>