.\" Automatically generated by Pod::Man 2.22 (Pod::Simple 3.07)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings. \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote. \*(C+ will
.\" give a nicer C++. Capital omega is used to do unbreakable dashes and
.\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
. ds -- \(*W-
. ds PI pi
. if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
. if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch
. ds L" ""
. ds R" ""
. ds C` ""
. ds C' ""
'br\}
.el\{\
. ds -- \|\(em\|
. ds PI \(*p
. ds L" ``
. ds R" ''
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD. Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.ie \nF \{\
. de IX
. tm Index:\\$1\t\\n%\t"\\$2"
..
. nr % 0
. rr F
.\}
.el \{\
. de IX
..
.\}
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear. Run. Save yourself. No user-serviceable parts.
. \" fudge factors for nroff and troff
.if n \{\
. ds #H 0
. ds #V .8m
. ds #F .3m
. ds #[ \f1
. ds #] \fP
.\}
.if t \{\
. ds #H ((1u-(\\\\n(.fu%2u))*.13m)
. ds #V .6m
. ds #F 0
. ds #[ \&
. ds #] \&
.\}
. \" simple accents for nroff and troff
.if n \{\
. ds ' \&
. ds ` \&
. ds ^ \&
. ds , \&
. ds ~ ~
. ds /
.\}
.if t \{\
. ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
. ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
. ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
. ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
. ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
. ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
. \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
. \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
. \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
. ds : e
. ds 8 ss
. ds o a
. ds d- d\h'-1'\(ga
. ds D- D\h'-1'\(hy
. ds th \o'bp'
. ds Th \o'LP'
. ds ae ae
. ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "WWW::RobotRules 3"
.TH WWW::RobotRules 3 "2012-02-18" "perl v5.10.1" "User Contributed Perl Documentation"
.\" For nroff, turn off justification. Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
WWW::RobotRules \- database of robots.txt\-derived permissions
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
.Vb 2
\& use WWW::RobotRules;
\& my $rules = WWW::RobotRules\->new(\*(AqMOMspider/1.0\*(Aq);
\&
\& use LWP::Simple qw(get);
\&
\& {
\& my $robots_txt = get $url;
\& $rules\->parse($url, $robots_txt) if defined $robots_txt;
\& }
\&
\& {
\& my $robots_txt = get $url;
\& $rules\->parse($url, $robots_txt) if defined $robots_txt;
\& }
\&
\& # Now we can check if a URL is valid for those servers
\& # whose "robots.txt" files we\*(Aqve gotten and parsed:
\& if($rules\->allowed($url)) {
\& $c = get $url;
\& ...
\& }
.Ve
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
This module parses \fI/robots.txt\fR files as specified in
\&\*(L"A Standard for Robot Exclusion\*(R", at
Webmasters can use the \fI/robots.txt\fR file to forbid conforming
robots from accessing parts of their web site.
.PP
The parsed files are kept in a WWW::RobotRules object, and this object
provides methods to check if access to a given \s-1URL\s0 is prohibited. The
same WWW::RobotRules object can be used for one or more parsed
\&\fI/robots.txt\fR files on any number of hosts.
.PP
The following methods are provided:
.ie n .IP "$rules = WWW::RobotRules\->new($robot_name)" 4
.el .IP "\f(CW$rules\fR = WWW::RobotRules\->new($robot_name)" 4
.IX Item "$rules = WWW::RobotRules->new($robot_name)"
This is the constructor for WWW::RobotRules objects. The first
argument given to \fInew()\fR is the name of the robot.
.ie n .IP "$rules\->parse($robot_txt_url, $content, $fresh_until)" 4
.el .IP "\f(CW$rules\fR\->parse($robot_txt_url, \f(CW$content\fR, \f(CW$fresh_until\fR)" 4
.IX Item "$rules->parse($robot_txt_url, $content, $fresh_until)"
The \fIparse()\fR method takes as arguments the \s-1URL\s0 that was used to
retrieve the \fI/robots.txt\fR file, and the contents of the file.
.ie n .IP "$rules\->allowed($uri)" 4
.el .IP "\f(CW$rules\fR\->allowed($uri)" 4
.IX Item "$rules->allowed($uri)"
Returns \s-1TRUE\s0 if this robot is allowed to retrieve this \s-1URL\s0.
.ie n .IP "$rules\->agent([$name])" 4
.el .IP "\f(CW$rules\fR\->agent([$name])" 4
.IX Item "$rules->agent([$name])"
Get/set the agent name. \s-1NOTE:\s0 Changing the agent name will clear the robots.txt
rules and expire times out of the cache.
.SH "ROBOTS.TXT"
.IX Header "ROBOTS.TXT"
The format and semantics of the \*(L"/robots.txt\*(R" file are as follows
(this is an edited abstract of
.PP
The file consists of one or more records separated by one or more
blank lines. Each record contains lines of the form
.PP
.Vb 1
\& <field\-name>: <value>
.Ve
.PP
The field name is case insensitive. Text after the '#' character on a
line is ignored during parsing. This is used for comments. The
following <field\-names> can be used:
.IP "User-Agent" 3
.IX Item "User-Agent"
The value of this field is the name of the robot the record is
describing access policy for. If more than one \fIUser-Agent\fR field is
present the record describes an identical access policy for more than
one robot. At least one field needs to be present per record. If the
value is '*', the record describes the default access policy for any
robot that has not not matched any of the other records.
.Sp
The \fIUser-Agent\fR fields must occur before the \fIDisallow\fR fields. If a
record contains a \fIUser-Agent\fR field after a \fIDisallow\fR field, that
constitutes a malformed record. This parser will assume that a blank
line should have been placed before that \fIUser-Agent\fR field, and will
break the record into two. All the fields before the \fIUser-Agent\fR field
will constitute a record, and the \fIUser-Agent\fR field will be the first
field in a new record.
.IP "Disallow" 3
.IX Item "Disallow"
The value of this field specifies a partial \s-1URL\s0 that is not to be
visited. This can be a full path, or a partial path; any \s-1URL\s0 that
starts with this value will not be retrieved
.PP
Unrecognized records are ignored.
.SH "ROBOTS.TXT EXAMPLES"
.IX Header "ROBOTS.TXT EXAMPLES"
The following example \*(L"/robots.txt\*(R" file specifies that no robots
should visit any \s-1URL\s0 starting with \*(L"/cyberworld/map/\*(R" or \*(L"/tmp/\*(R":
.PP
.Vb 3
\& User\-agent: *
\& Disallow: /cyberworld/map/ # This is an infinite virtual URL space
\& Disallow: /tmp/ # these will soon disappear
.Ve
.PP
This example \*(L"/robots.txt\*(R" file specifies that no robots should visit
any \s-1URL\s0 starting with \*(L"/cyberworld/map/\*(R", except the robot called
\&\*(L"cybermapper\*(R":
.PP
.Vb 2
\& User\-agent: *
\& Disallow: /cyberworld/map/ # This is an infinite virtual URL space
\&
\& # Cybermapper knows where to go.
\& User\-agent: cybermapper
\& Disallow:
.Ve
.PP
This example indicates that no robots should visit this site further:
.PP
.Vb 3
\& # go away
\& User\-agent: *
\& Disallow: /
.Ve
.PP
This is an example of a malformed robots.txt file.
.PP
.Vb 10
\& # robots.txt for ancientcastle.example.com
\& # I\*(Aqve locked myself away.
\& User\-agent: *
\& Disallow: /
\& # The castle is your home now, so you can go anywhere you like.
\& User\-agent: Belle
\& Disallow: /west\-wing/ # except the west wing!
\& # It\*(Aqs good to be the Prince...
\& User\-agent: Beast
\& Disallow:
.Ve
.PP
This file is missing the required blank lines between records.
However, the intention is clear.
.SH "SEE ALSO"
.IX Header "SEE ALSO"
LWP::RobotUA, WWW::RobotRules::AnyDBM_File
.SH "COPYRIGHT"
.IX Header "COPYRIGHT"
.Vb 2
\& Copyright 1995\-2009, Gisle Aas
\& Copyright 1995, Martijn Koster
.Ve
.PP
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.