The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Gungho - Yet Another High Performance Web Crawler Framework

SYNOPSIS

  use Gungho;
  my $g = Gungho->new($config);
  $g->run;

DESCRIPTION

Gungho is Yet Another Web Crawler Framework, aimed to be an extensible and fast. Its meant to be a culmination of lessons learned while building Xango -- Xango was *fast*, but it was horribly hard to debug. Gungho tries to build from clean structures, based upon principles from the likes of Catalyst and Plagger.

All components (engine, provider, handler) are overridable and switcheable. Plugin mechanism is available to add hooks to be executed during the run.

WARNING: *ALL* APIs are still subject to change.

STRUCTURE

Gungho is comprised of three parts. A Provider, which provides Gungho with requests to process, a Handler, which handles the fetched page, and an Engine, which controls the entire process.

There are also "hooks". These hooks can be registered from anywhere by invoking the register_hook() method. They are run at particular points, which are specified when you call register_hook().

HOOKS

Currently available hooks are:

engine.send_request

engine.handle_response

METHODS

new($config)

Creates a new Gungho instance. It requires either the name of a config filename or a hashref.

run

Starts the Gungho process.

has_feature($name)

Returns true if Gungho supports some feature $name

setup()

Sets up the Gungho environment, including calling the various setup_* methods to configure the provider, engine, handler, etc.

setup_components()

setup_engine()

setup_handler()

setup_log()

setup_provider()

setup_plugins()

Sets up the various components.

register_hook($hook_name => $coderef[, $hook_name => $coderef])

Registers a hook to be run under the specified $hook_name

run_hook($hook_name)

Runs all the hooks under the hook $hook_name

has_requests

Delegates to provider's has_requests

get_requests

Delegates to provider's get_requests

handle_response

Delegates to handler's handle_response

dispatch_requests

Calls provider->dispatch

prepare_request($req)

Given a request, preps it before sending it to the engine

send_request

Delegates to engine's send_request

load_config($config)

Loads the config from $config via Config::Any.

load_gungho_module($name, $prefix)

Loads a Gungho component. Compliments the module name with 'Gungho::$prefix::', unless the name is prefixed with a '+'. In that case, no transformation is performed, and the module name is used as-is.

CODE

You can obtain the current code base from

  http://gungho-crawler.googlecode.com/svn/trunk

AUTHOR

Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>

All rights reserved.

LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html