NAME
WWW::SimpleRobot - a simple web robot for recursively following links on web pages.
SYNOPSIS
use
WWW::SimpleRobot;
my
$robot
= WWW::SimpleRobot->new(
DEPTH
=> 1,
TRAVERSAL
=>
'depth'
,
VISIT_CALLBACK
=>
sub
{
my
(
$url
,
$depth
,
$html
,
$links
) =
@_
;
STDERR
"Visiting $url\n"
;
STDERR
"Depth = $depth\n"
;
STDERR
"HTML = $html\n"
;
STDERR
"Links = @$links\n"
;
}
,
BROKEN_LINK_CALLBACK
=>
sub
{
my
(
$url
,
$linked_from
,
$depth
) =
@_
;
STDERR
"$url looks like a broken link on $linked_from\n"
;
STDERR
"Depth = $depth\n"
;
}
);
$robot
->traverse;
my
@urls
= @{
$robot
->urls};
my
@pages
= @{
$robot
->pages};
for
my
$page
(
@pages
)
{
my
$url
=
$page
->{url};
my
$depth
=
$page
->{depth};
my
$modification_time
=
$page
->{modification_time};
}
DESCRIPTION
A simple perl module
for
doing robot stuff. For a more elaborate interface,
see WWW::Robot. This version uses LWP::Simple to grab pages, and
HTML::LinkExtor to extract the links from them. Only href attributes of
anchor tags are extracted. Extracted links are checked against the
FOLLOW_REGEX regex to see
if
they should be followed. A HEAD request is
made to these links, to check that they are
'text/html'
type pages.
BUGS
This robot doesn't respect the Robot Exclusion Protocol
robot!), and doesn
't do any exception handling if it can'
t get pages - it
just ignores them and goes on to the
next
page!
AUTHOR
Ave Wrigley <Ave.Wrigley@itn.co.uk>
COPYRIGHT
Copyright (c) 2001 Ave Wrigley. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.