The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Gungho - 高性能Webクローラーフレームワーク

SYNOPSIS

  use Gungho;
  Gungho->run($config);

DESCRIPTION

Gunghoã�¯é«˜æ€§èƒ½Webクローラーフレームワークã�§ã�™ã€‚高速ã�ªHTTP処ç�†ã‚’è¡Œã�„ã�¤ã�¤ã€� 機能拡張をã�—ã‚„ã�™ã�„よã�†ãƒ•ãƒ¬ã‚­ã‚·ãƒ–ルã�ªæ§‹é€ を目指ã�—ã�¦é–‹ç™ºã�•ã‚Œã�¦ã�„ã�¾ã�™ã€‚

ç�¾åœ¨Gunghoã�¯Î²ç‰ˆã�§ã�™ã€‚機能的ï¼�仕様的ã�«ã‚‚比較的安定ã�—ã�¤ã�¤ã�‚ã‚Šã�¾ã�™ã�Œã€�ã�¾ã� 内部的ã�ªAPIç­‰ã�¯å¤§å¹…ã�ªå¤‰æ›´ã�ŒåŠ ã‚�ã‚‹å�¯èƒ½æ€§ã�Œã�‚ã‚Šã�¾ã�™ã�®ã�§ã�”注æ„�ã��ã� ã�•ã�„。

Gunghoをインストール�る�自動的�以下�機能�使�るよ���り��:

イベント型��期エンジン

Gunghoã�¯POEã€�Danga::Socketã€�IO::Async等をベースã�«ã�—ã�Ÿé�žå�ŒæœŸã‚¨ãƒ³ã‚¸ãƒ³ã‚’使ã�„ クロールを行ã�„ã�¾ã�™ã€‚ã�‚ã�ªã�Ÿã�®ãƒ‹ãƒ¼ã‚ºã�«ã�‚ã�£ã�Ÿã‚¨ãƒ³ã‚¸ãƒ³ã‚’é�¸ã‚“ã�§ã��ã� ã�•ã�„。

��期DNS解決

HTTP通信���期�行�れる�らも��んDNS通信も��期�行���。 Gungho�DNS解決を���る間もブロック���他�処�を進�られ��。

自動robots.txt処�

全��クローラー�robots.txtを正��処����止�れ��るURL��アクセス ���よ���る����。Gungho���robots.txt処����比較的�倒� 処�を自動的�行���。memcached��も�使用�れ�分散環境�も使用�能��。

メタタグ内�ロボットディレクティブ処�

ロボットディレクティブ�HTML�METAタグ内�埋�込�れ�ロボット用�制御構文 ��。Gungho����ディレクティブを自動的�パース��ユーザー�扱�るよ�� ���。

スロットリング

クロール対象ã�¨ã�ªã�£ã�¦ã�„るサイトã�«é�Žåº¦ã�®è² è�·ã‚’ã�‹ã�‘ã�¦ã‚µã‚¤ãƒˆã‚’è�½ã�¨ã�—ã�¦ã�¯å…ƒã‚‚å­�ã‚‚ ã�‚ã‚Šã�¾ã�›ã‚“。スロットリングモジュールを使ã�†äº‹ã�«ã‚ˆã�£ã�¦Gunghoã�§ã�¯ãƒªã‚¯ã‚¨ã‚¹ãƒˆæ•°ã‚’ 絞り込む事ã�Œå�¯èƒ½ã�§ã�™ã€‚

内部��IP�止

クロールã�—ã�¦ã�„るサイトã�®DNSã�®è¨­å®šã�Œé–“é�•ã�£ã�¦ã�„ã�Ÿã‚Šã€�æ„�図的ã�«ã��ã�®ã‚ˆã�†ã�ªURLã‚’ 埋ã‚�込んã�§ã�‚ã�£ã�Ÿå ´å�ˆãƒªã‚¯ã‚¨ã‚¹ãƒˆã�Œè‡ªåˆ†ã�®å†…部ãƒ�ットワークã�®IPアドレスã�«å�‘ã�„ã�¦ã�—ã�¾ã�„ DoSを引ã��èµ·ã�“ã�™å�¯èƒ½æ€§ã�Œã�‚ã‚Šã�¾ã�™ã€‚ã�“ã�®ã‚»ã‚­ãƒ¥ãƒªãƒ†ã‚£ãƒªã‚¹ã‚¯ã‚’Gunghoを監視ã�—ã�¾ã�™ã€‚

キャッシュ

Catalystキャッシュã�®ã‚ˆã�†ã�ªã‚­ãƒ£ãƒƒã‚·ãƒ¥ã‚’使ã�„ã�Ÿã�„å ´å�ˆã�¯Cacheコンãƒ�ーãƒ�ントを 使用ã�™ã‚‹ã� ã�‘ã�§ãƒ—ログラム内ã�‹ã‚‰ã‚­ãƒ£ãƒƒã‚·ãƒ¥ã‚’扱ã�ˆã‚‹ã‚ˆã�†ã�«ã�ªã‚Šã�¾ã�™ã€‚

Web::Scraperサ�ート

Web::Scraperã‚’Gungho内ã�‹ã‚‰ç°¡å�˜ã�«æ‰±ã�ˆã‚‹ã‚ˆã�†ã�«ã�—ã�¦ã�„ã�¾ã�™ (ã�“ã�®æ©Ÿèƒ½ã�¯ç�¾åœ¨ã�¾ã� 安定稼åƒ�ã�—ã�¦ã�„ã�¾ã�›ã‚“)

リクエストログ

RequestLogプラグインを使用�る���よ��自動的��得�れ���URLを ログ��行�事�����。

歴�

First there were a bunch of scripts that used scrape a bunch of RSS feeds. Then I got tired of writing scripts, so I decided a framework is the way to go, and Xango was born.

Xango was my first attempt at trying to harness the full power of event-based framework. It was fast. It wasn't fun to extend. It had a nightmare-ish way to deal with robots.txt.

Couple of more attempts later, more inspirations and lessons learned from Catalyst, Plagger, DBIx::Class, Gungho was born.

Since its inception, Gungho has been in successfully used as crawlers that fetch hundreds of thousands of urls to a few million urls per day.

PLEASE READ BEFORE USE

Gungho is designed to so that it can handle massive amount of traffic. If you're careful enough with your Provider and Handler implementation, you can in fact hit millions of URL with this crawler.

So PLEASE DO NOT LET IT LOOSE. DO NOT OVERLOAD your crawl targets. You are STRONGLY advised to use Gungho::Component::Throttle to throttle your fetches.

Also PLEASE CHANGE THE USER AGENT NAME OF YOUR CRAWLER. If you hit your targets hard with the default name (Gungho/VERSION X.XXXX), it will look as though a service called Gungho is hitting their site, which really isn't the case. Whatever it is, please specify at least a simple user agent in your config

STRUCTURE

Gungho is comprised of three parts. A Provider, which provides Gungho with requests to process, a Handler, which handles the fetched page, and an Engine, which controls the entire process.

There are also "hooks". These hooks can be registered from anywhere by invoking the register_hook() method. They are run at particular points, which are specified when you call register_hook().

All components (engine, provider, handler) are overridable and switcheable. However, do note that if you plan on customizing stuff, you should be aware that Gungho uses Class::C3 extensively, and hence you may see warnings about the code you use.

Gungho�間���使�方

Gungho�膨大�数�URLを�常的��得�る���設計�れ����。も� Gunghoを����URL�も�������ホスト�対��扱����れ�注�を ����。

上記ã�®ã‚ˆã�†ã�ªç’°å¢ƒã�§Gunghoã‚’å‹•ã�‹ã�™å ´å�ˆã�¯å��分ã�ªãƒ‘フォーマンスã�Œå‡ºã�›ã�ªã�„å�¯èƒ½æ€§ã�Œ 高ã��ã€�ã�²ã‚‡ã�£ã�¨ã�™ã‚‹ã�¨LWP::UserAgentã�®ã‚ˆã�†ã�ªãƒ¢ã‚¸ãƒ¥ãƒ¼ãƒ«ã‚’使ã�£ã�Ÿã�»ã�†ã�Œè‰¯ã�„ã�‹ã‚‚ ã�—ã‚Œã�¾ã�›ã‚“。

ã‚‚ã�¡ã‚�ã‚“LWP::UserAgentã�«ã�¯å­˜åœ¨ã�—ã�ªã�„Gunghoã�®æ©Ÿèƒ½ã‚’使用ã�™ã‚‹ã�Ÿã‚�ã�«Gunghoã‚’ 使ã�†ã�®ã‚‚よã�„ã�‹ã‚‚知れã�¾ã�›ã‚“ã�Œã€�ãƒ�ューニングã�Œå¿…è¦�ã�§ã�‚ã‚‹ã�“ã�¨ã‚’èª�è­˜ã�—ã�¦ã�„ã�¦ ã��ã� ã�•ã�„

GLOBAL CONFIGURATION OPTIONS

debug
   ---
   debug: 1

Setting debug to a non-zero value will trigger debug messages to be displayed.

COMPONENTS

Components add new functionality to Gungho. Components are loaded at startup time from the config file / hash given to Gungho constructor.

  Gungho->run({
    components => [
      'Throttle::Simple'
    ],
    throttle => {
      max_interval => ...,
    }
  });

Components modify Gungho's inheritance structure at run time to add extra functionality to Gungho, and therefore should only be loaded before starting the engine.

Please refer to each component's document for details

Gungho::Component::Authentication::Basic
Gungho::Component::BlockPrivateIP
Gungho::Component::Cache
Gungho::Component::RobotRules
Gungho::Component::RobotsMETA
Gungho::Component::Scraper
Gungho::Component::Throttle::Domain
Gungho::Component::Throttle::Simple

INLINE

If you're looking into simple crawlers, you may want to look at Gungho::Inline,

  Gungho::Inline->run({
    provider => sub { ... },
    handler  => sub { ... }
  });

See the manual for Gungho::Inline for details.

PLUGINS

Plugins are different from components in that, whereas components require the developer to explicitly call the methods, plugins are loaded and are not touched afterwards.

Please refer to the documentation of each plugin for details.

RequestLog
Statistics

HOOKS

Currently available hooks are:

engine.send_request

engine.handle_response

METHODS

component_base_class

Used for Class::C3::Componentised

CODE

コード�Google Code�管��れ����。レ�ジトリ�以下URL��管�れ����

  http://gungho-crawler.googlecode.com/svn/trunk

AUTHOR

Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>

CONTRIBUTORS

Kazuho Oku
Keiichi Okabe

LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html

1 POD Error

The following errors were encountered while parsing the POD:

Around line 3:

Non-ASCII character seen before =encoding in '高性能Webクローラーフレームワーク'. Assuming CP1252