The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Leech::Walker - small web content grabbing framework

SYNOPSIS

  use WWW::Leech::Walker;

  my $walker = new WWW::Leech::Walker({
        ua => new LWP::UserAgent(),
        url => 'http://example.tdl',

        parser => $www_leech_parser_params,

        state => {},

        logger => sub {print shift()},

        raw_preprocessor => sub{
                my $html = shift();
                my $walker_obj = shift;
                return $html;
        },

        filter => sub{
                my $urls = shift;
                my $walker_obj = shift;

                # ... filter urls

                return $urls
        },

        processor => sub {
                my $data = shift;
                my $walker_obj = shift;

                # ... process grabbed data

        },

        error_handler => sub {
                my $item = shift;
                my $walker_obj = shift;
                my $error_text = shift;

                # ... handle error

        }
  });

  $walker->leech();

DESCRIPTION

WWW::Leech::Walker walks through a given website parsing content and generating structured data. Declarative interface makes Walker some sort of a framework.

This module is designed to extract data from sites with particular structure: an index page (or any other provided as a root page) contains links to individual pages representing items that should be grabbed. Index page may also contain 'paging' links (e.g. http://exmple.tdl/?page=2) which lead to the page with similar structure. The closest example is a products category page with links to individual products and links to 'sub-pages'.

All required parameters are set as constructor arguments. Other methods are used to start/stop the grabbing process and launch logger (see below).

DETAILS

new($params)

$params must be a hashref providing all data required.

ua

LWP compatible user-agent object.

url

Starting url.

post_data

Url-encoded post data. By default Walker will fetch items list using GET method. POST method is used if post_data is set. Requests fetching individual items pages are still using GET method.

parser

Parameters for WWW::Leech::Parser

state

Optional user-filled value. Walker does not use it directly. State is passed to user callbacks instead. Defaults to empty hashref.

logger

Optional logging callback. Whenever something happens walker runs this subroutine passing message.

filter

Optional urls filtering callback. When walker gets a list of items-pages urls it passes that list to the filter subroutine. Walker itself is passed as a second argument and an arrayref with links text as third. Walker expects it to return filtered list. Empty list is okay.

processor

This callback is launched after the individual item is parsed and converted to a hashref. This hashref is passed to the processor to be saved, or processed in some other way.

raw_preprocessor

This optional callback is launched after any page was retrieved but before parsing started. Walker expects it to return scalar.

error_handler

Walker dies with an error message if something goes wrong while parsing. Providing this optional callback allows caller to handle such errors. Skip page with broken html for example.

next_page_link_post_process

This optional callback allows user to alter next page url. Usually these urls look like 'http://example.tld/list?page=2' and no changes needed there. But sometimes such links are javascript calls like 'javascript:gotoPageNumber(2)'. The source url is passed as is before walker absolutizes it. Walker passes current page url as a third agument - this may be usefull for links like 'javascript:gotoNextPage()'

Walker expects this callback to return a fixed url.

leech()

Starts the process.

stop()

Stops the process completely. By default walker keeps working untill there are links. Some sites may contain zillions of pages, while only first million is required. This method allows to stop at some point. See "CALLBACKS" section below.

If walker is restarted with leech() method it will run as if it was newly created (still the 'state' is saved).

log($message)

Runs the 'logger' callback with $message argument.

getCurrentDOM()

Returns DOM currently beeing processed.

CALLBACKS

Walker passes callback specific data as a first argument, itself as a second and some additional data as third if any.

When grabbing large sites the grabbing process should be stopped at some point (if you don't need all the data of course). This example shows how to do it using state propery and stop() method:

  #....
  state => {total_links_amount => 0},
  filter => sub{
    my $links = shift;
    my $walker = shift;

    if($walker->{'state'}->{'total_links_amount'} > 1_000_000 ){
        $walker->log("Million of items grabbed. Enough.");
        $walker->stop();

        return [];
    }

    $walker->{'state'}->{'total_links_amount'} += scalar(@$links);

    return $links;
  }
  #....

AUTHOR

    Dmitry Selverstov
    CPAN ID: JAREDSPB
    jaredspb@cpan.org

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.