WWW::Crawler::Mojo - A web crawling framework for Perl
use strict; use warnings; use WWW::Crawler::Mojo; my $bot = WWW::Crawler::Mojo->new; $bot->on(res => sub { my ($bot, $scrape, $job, $res) = @_; $scrape->(); }); $bot->on(refer => sub { my ($bot, $enqueue, $job, $context) = @_; $enqueue->(); }); $bot->enqueue('http://example.com/'); $bot->crawl;
WWW::Crawler::Mojo is a web crawling framework for those who familier with Mojo::* APIs.
Note that the module is aimed at trivial use cases of crawling within a moderate range of web pages so DO NOT use it for persistent crawler jobs.
WWW::Crawler::Mojo inherits all attributes from Mojo::EventEmitter and implements the following new ones.
HTML element handler on scraping.
my $handlers = $bot->element_handlers; $bot->element_handlers->{img} = sub { my $dom = shift; return $dom->{src}; };
A Mojo::UserAgent instance.
my $ua = $bot->ua; $bot->ua(Mojo::UserAgent->new);
Name of crawler for User-Agent header.
$bot->ua_name('my-bot/0.01 (+https://example.com/)'); say $bot->ua_name; # 'my-bot/0.01 (+https://example.com/)'
A number of current connections.
$bot->active_conn($bot->active_conn + 1); say $bot->active_conn;
A number of current connections per host.
$bot->active_conns_per_host($bot->active_conns_per_host + 1); say $bot->active_conns_per_host;
A hash whoes keys are md5 hashes of enqueued URLs.
A number of max connections.
$bot->max_conn(5); say $bot->max_conn; # 5
A number of max connections per host.
$bot->max_conn_per_host(5); say $bot->max_conn_per_host; # 5
An port number for providing peeping monitor. It also evalutated as boolean for disabling/enabling the feature. Defaults to undef, meaning disable.
$bot->peeping_port(3001); say $bot->peeping_port; # 3000
Max length of peeping monitor content.
$bot->peeping_max_length(100000); say $bot->peeping_max_length; # 100000
FIFO array contains WWW::Crawler::Mojo::Job objects.
push(@{$bot->queue}, WWW::Crawler::Mojo::Job->new(...)); my $job = shift @{$bot->queue};
An interval in seconds to shuffle the job queue. It also evalutated as boolean for disabling/enabling the feature. Defaults to undef, meaning disable.
$bot->shuffle(5); say $bot->shuffle; # 5
WWW::Crawler::Mojo inherits all events from Mojo::EventEmitter and implements the following new ones.
Emitted when crawler got response from server. The callback takes 4 arguments.
$bot->on(res => sub { my ($bot, $scrape, $job, $res) = @_; if (...) { $scrape->(); } else { # DO NOTHING } });
WWW::Crawler::Mojo instance.
Scrape URLs out of the document. this is a shorthand of $bot->scrape($job)
WWW::Crawler::Mojo::Job instance.
Mojo::Message::Response instance.
Emitted when new URI is found. You can enqueue the URI conditionally with the callback.
$bot->on(refer => sub { my ($bot, $enqueue, $job, $context) = @_; if (...) { $enqueue->(); } elseif (...) { $enqueue->(...); # maybe different url } else { # DO NOTHING } });
Scrape URLs out of the document. this is a shorthand of $bot->enqueue($job)
Either Mojo::DOM or Mojo::URL instance.
Emitted when queue length got zero. The length is checked every 5 seconds.
$bot->on(empty => sub { my ($bot) = @_; say "Queue is drained out."; });
Emitted when user agent returns no status code for request. Possibly caused by network errors or un-responsible servers.
$bot->on(error => sub { my ($bot, $error, $job) = @_; say "error: $_[1]"; if (...) { # until failur occures 3 times $bot->requeue($job); } });
Note that server errors such as 404 or 500 cannot be catched with the event. Consider res event for the use case instead of this.
Emitted right before crawl is started.
$bot->on(start => sub { my $self = shift; ... });
WWW::Crawler::Mojo inherits all methods from Mojo::EventEmitter and implements the following new ones.
Start crawling loop.
$bot->crawl;
Initialize crawler settings.
$bot->init;
Process a job.
$bot->process_job;
Displays starting messages to STDOUT
$bot->say_start;
peeping API dispatcher.
$bot->peeping_handler($loop, $stream);
Parses and discovers links in a web page. Each links are appended to FIFO array.
$bot->scrape($res, $job);
Append one or more URIs or WWW::Crawler::Mojo::Job objects.
$bot->enqueue('http://example.com/index1.html');
OR
$bot->enqueue($job1, $job2);
$bot->enqueue( 'http://example.com/index1.html', 'http://example.com/index2.html', 'http://example.com/index3.html', );
Append one or more URLs or jobs for re-try. This accepts same arguments as enqueue method.
$self->on(error => sub { my ($self, $msg, $job) = @_; if (...) { # until failur occures 3 times $bot->requeue($job); } });
Collects URLs out of HTML.
$bot->collect_urls_html($dom, sub { my ($uri, $dom) = @_; });
Collects URLs out of CSS.
@urls = collect_urls_css($dom);
Guesses encoding of HTML or CSS with given Mojo::Message::Response instance.
$encode = WWW::Crawler::Mojo::guess_encoding($res) || 'utf-8'
Resolves URLs with a base URL.
WWW::Crawler::Mojo::resolve_href($base, $uri);
https://github.com/jamadam/WWW-Flatten
Sugama Keita, <sugama@jamadam.com>
Copyright (C) jamadam
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install WWW::Crawler::Mojo, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WWW::Crawler::Mojo
CPAN shell
perl -MCPAN -e shell install WWW::Crawler::Mojo
For more information on module installation, please visit the detailed CPAN module installation guide.