WWW::Crawler::Lite - A single-threaded crawler/spider for the web.
my %pages = ( ); my $pattern = 'https?://example\.com\/'; my %links = ( ); my $downloaded = 0; my $crawler; $crawler = WWW::Crawler::Lite->new( agent => 'MySuperBot/1.0', url_pattern => $pattern, http_accept => [qw( text/plain text/html application/xhtml+xml )], link_parser => 'default', on_response => sub { my ($url, $res) = @_; warn "$url contains " . $res->content; $downloaded++; $crawler->stop() if $downloaded++ > 5; }, follow_ok => sub { my ($url) = @_; # If you like this url and want to use it, then return a true value: return 1; }, on_link => sub { my ($from, $to, $text) = @_; return if exists($pages{$to}) && $pages{$to} eq 'BAD'; $pages{$to}++; $links{$to} ||= [ ]; push @{$links{$to}}, { from => $from, text => $text }; }, on_bad_url => sub { my ($url) = @_; # Mark this url as 'bad': $pages{$url} = 'BAD'; } ); $crawler->crawl( url => "http://example.com/" ); warn "DONE!!!!!"; use Data::Dumper; map { warn "$_ ($pages{$_} incoming links) -> " . Dumper($links{$_}) } sort keys %links;
WWW::Crawler::Lite is a single-threaded spider/crawler for the web. It can be used within a mod_perl, CGI or Catalyst-style environment because it does not fork or use threads.
WWW::Crawler::Lite
The callback-based interface is fast and simple, allowing you to focus on simply processing the data that WWW::Crawler::Lite extracts from the target website.
Creates and returns a new WWW::Crawler::Lite object.
The %args hash is not required, but may contain the following elements:
%args
Used as the user-agent string for HTTP requests.
Default Value: - WWW-Crawler-Lite/$VERSION $^O
WWW-Crawler-Lite/$VERSION $^O
New links that do not match this pattern will not be added to the processing queue.
Default Value: https?://.+
https?://.+
This can be used to filter out unwanted responses.
Valid values: 'default' and 'HTML::LinkExtor'
default
HTML::LinkExtor
The default value is 'default' which uses a naive regexp to do the link parsing.
The upshot of using 'default' is that the regexp will also find the hyperlinked text or alt-text (of a hyperlinked img tag) and give that to your 'on_link' handler.
on_link
Default Value: [qw( text/html text/plain application/xhtml+xml )]
[qw( text/html text/plain application/xhtml+xml )]
Called whenever a successful response is returned.
Called whenever a new link is found. Arguments are:
The URL that is linked *from*
The URL that is linked *to*
The anchor text (eg: The HTML within the link - <a href="...">This Text Here</a>)
Called whenever an unsuccessful response is received.
Indicates the length of time (in seconds) that the crawler should pause before making each request. This can be useful when you want to spider a website, not launch a denial of service attack on it.
Causes the crawler to stop processing its queue of URLs.
John Drago <jdrago_999@yahoo.com>
This software is Free software and may be used and redistributed under the same terms as perl itself.
To install WWW::Crawler::Lite, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WWW::Crawler::Lite
CPAN shell
perl -MCPAN -e shell install WWW::Crawler::Lite
For more information on module installation, please visit the detailed CPAN module installation guide.