The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Crawler::Mojo - A web crawling framework for Perl

SYNOPSIS

    use strict;
    use warnings;
    use WWW::Crawler::Mojo;
    
    my $bot = WWW::Crawler::Mojo->new;
    
    $bot->on(res => sub {
        my ($bot, $scrape, $job, $res) = @_;
        
        $scrape->();
    });
    
    $bot->on(refer => sub {
        my ($bot, $enqueue, $job, $context) = @_;
        
        $enqueue->();
    });
    
    $bot->enqueue('http://example.com/');
    $bot->crawl;

DESCRIPTION

WWW::Crawler::Mojo is a web crawling framework for those who familier with Mojo::* APIs.

Note that the module is aimed at trivial use cases of crawling within a moderate range of web pages so DO NOT use it for persistent crawler jobs.

ATTRIBUTES

WWW::Crawler::Mojo inherits all attributes from Mojo::EventEmitter and implements the following new ones.

element_handlers

HTML element handler on scraping.

    my $handlers = $bot->element_handlers;
    $bot->element_handlers->{img} = sub {
        my $dom = shift;
        return $dom->{src};
    };

ua

A Mojo::UserAgent instance.

    my $ua = $bot->ua;
    $bot->ua(Mojo::UserAgent->new);

ua_name

Name of crawler for User-Agent header.

    $bot->ua_name('my-bot/0.01 (+https://example.com/)');
    say $bot->ua_name; # 'my-bot/0.01 (+https://example.com/)'

active_conn

A number of current connections.

    $bot->active_conn($bot->active_conn + 1);
    say $bot->active_conn;

active_conns_per_host

A number of current connections per host.

    $bot->active_conns_per_host($bot->active_conns_per_host + 1);
    say $bot->active_conns_per_host;

fix

A hash whoes keys are md5 hashes of enqueued URLs.

max_conn

A number of max connections.

    $bot->max_conn(5);
    say $bot->max_conn; # 5

max_conn_per_host

A number of max connections per host.

    $bot->max_conn_per_host(5);
    say $bot->max_conn_per_host; # 5

peeping_port

An port number for providing peeping monitor. It also evalutated as boolean for disabling/enabling the feature. Defaults to undef, meaning disable.

    $bot->peeping_port(3001);
    say $bot->peeping_port; # 3000

peeping_max_length

Max length of peeping monitor content.

    $bot->peeping_max_length(100000);
    say $bot->peeping_max_length; # 100000

queue

FIFO array contains WWW::Crawler::Mojo::Job objects.

    push(@{$bot->queue}, WWW::Crawler::Mojo::Job->new(...));
    my $job = shift @{$bot->queue};

shuffle

An interval in seconds to shuffle the job queue. It also evalutated as boolean for disabling/enabling the feature. Defaults to undef, meaning disable.

    $bot->shuffle(5);
    say $bot->shuffle; # 5

EVENTS

WWW::Crawler::Mojo inherits all events from Mojo::EventEmitter and implements the following new ones.

res

Emitted when crawler got response from server. The callback takes 4 arguments.

    $bot->on(res => sub {
        my ($bot, $scrape, $job, $res) = @_;
        if (...) {
            $scrape->();
        } else {
            # DO NOTHING
        }
    });
$bot

WWW::Crawler::Mojo instance.

$scrape

Scrape URLs out of the document. this is a shorthand of $bot->scrape($job)

$job

WWW::Crawler::Mojo::Job instance.

$res

Mojo::Message::Response instance.

refer

Emitted when new URI is found. You can enqueue the URI conditionally with the callback.

    $bot->on(refer => sub {
        my ($bot, $enqueue, $job, $context) = @_;
        if (...) {
            $enqueue->();
        } elseif (...) {
            $enqueue->(...); # maybe different url
        } else {
            # DO NOTHING
        }
    });
$bot

WWW::Crawler::Mojo instance.

$enqueue

Scrape URLs out of the document. this is a shorthand of $bot->enqueue($job)

$job

WWW::Crawler::Mojo::Job instance.

$context

Either Mojo::DOM or Mojo::URL instance.

empty

Emitted when queue length got zero. The length is checked every 5 seconds.

    $bot->on(empty => sub {
        my ($bot) = @_;
        say "Queue is drained out.";
    });

error

Emitted when user agent returns no status code for request. Possibly caused by network errors or un-responsible servers.

    $bot->on(error => sub {
        my ($bot, $error, $job) = @_;
        say "error: $_[1]";
        if (...) { # until failur occures 3 times
            $bot->requeue($job);
        }
    });

Note that server errors such as 404 or 500 cannot be catched with the event. Consider res event for the use case instead of this.

start

Emitted right before crawl is started.

    $bot->on(start => sub {
        my $self = shift;
        ...
    });

METHODS

WWW::Crawler::Mojo inherits all methods from Mojo::EventEmitter and implements the following new ones.

crawl

Start crawling loop.

    $bot->crawl;

init

Initialize crawler settings.

    $bot->init;

process_job

Process a job.

    $bot->process_job;

say_start

Displays starting messages to STDOUT

    $bot->say_start;

peeping_handler

peeping API dispatcher.

    $bot->peeping_handler($loop, $stream);

scrape

Parses and discovers links in a web page. Each links are appended to FIFO array.

    $bot->scrape($res, $job);

enqueue

Append one or more URIs or WWW::Crawler::Mojo::Job objects.

    $bot->enqueue('http://example.com/index1.html');

OR

    $bot->enqueue($job1, $job2);

OR

    $bot->enqueue(
        'http://example.com/index1.html',
        'http://example.com/index2.html',
        'http://example.com/index3.html',
    );

requeue

Append one or more URLs or jobs for re-try. This accepts same arguments as enqueue method.

    $self->on(error => sub {
        my ($self, $msg, $job) = @_;
        if (...) { # until failur occures 3 times
            $bot->requeue($job);
        }
    });

collect_urls_html

Collects URLs out of HTML.

    $bot->collect_urls_html($dom, sub {
        my ($uri, $dom) = @_;
    });

collect_urls_css

Collects URLs out of CSS.

    @urls = collect_urls_css($dom);

guess_encoding

Guesses encoding of HTML or CSS with given Mojo::Message::Response instance.

    $encode = WWW::Crawler::Mojo::guess_encoding($res) || 'utf-8'

resolve_href

Resolves URLs with a base URL.

    WWW::Crawler::Mojo::resolve_href($base, $uri);

EXAMPLE

https://github.com/jamadam/WWW-Flatten

AUTHOR

Sugama Keita, <sugama@jamadam.com>

COPYRIGHT AND LICENSE

Copyright (C) jamadam

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.