Gungho - 高性能Webクãƒãƒ¼ãƒ©ãƒ¼ãƒ•ãƒ¬ãƒ¼ãƒ ワーク
use Gungho; Gungho->run($config);
Gunghoã�¯é«˜æ€§èƒ½Webクãƒãƒ¼ãƒ©ãƒ¼ãƒ•ãƒ¬ãƒ¼ãƒ ワークã�§ã�™ã€‚高速ã�ªHTTP処ç�†ã‚’è¡Œã�„ã�¤ã�¤ã€� 機能拡張をã�—ã‚„ã�™ã�„よã�†ãƒ•ãƒ¬ã‚シブルã�ªæ§‹é€ を目指ã�—ã�¦é–‹ç™ºã�•ã‚Œã�¦ã�„ã�¾ã�™ã€‚
ç�¾åœ¨Gunghoã�¯Î²ç‰ˆã�§ã�™ã€‚機能的ï¼�仕様的ã�«ã‚‚比較的安定ã�—ã�¤ã�¤ã�‚ã‚Šã�¾ã�™ã�Œã€�ã�¾ã� 内部的ã�ªAPIç‰ã�¯å¤§å¹…ã�ªå¤‰æ›´ã�ŒåŠ ã‚�ã‚‹å�¯èƒ½æ€§ã�Œã�‚ã‚Šã�¾ã�™ã�®ã�§ã�”注æ„�ã��ã� ã�•ã�„。
Gunghoをインストール�る�自動的�以下�機能�使�るよ���り��:
Gunghoã�¯POEã€�Danga::Socketã€�IO::Asyncç‰ã‚’ベースã�«ã�—ã�Ÿé�žå�ŒæœŸã‚¨ãƒ³ã‚¸ãƒ³ã‚’使ã�„ クãƒãƒ¼ãƒ«ã‚’è¡Œã�„ã�¾ã�™ã€‚ã�‚ã�ªã�Ÿã�®ãƒ‹ãƒ¼ã‚ºã�«ã�‚ã�£ã�Ÿã‚¨ãƒ³ã‚¸ãƒ³ã‚’é�¸ã‚“ã�§ã��ã� ã�•ã�„。
HTTP通信ã�¯é�žå�ŒæœŸã�§è¡Œã‚�れるã�ªã‚‰ã‚‚ã�¡ã‚�ã‚“DNS通信もé�žå�ŒæœŸã�§è¡Œã�ˆã�¾ã�™ã€‚ Gunghoã�¯DNS解決をã�—ã�¦ã�„る間もブãƒãƒƒã‚¯ã�›ã�šã�«ä»–ã�®å‡¦ç�†ã‚’進ã‚�られã�¾ã�™ã€‚
å…¨ã�¦ã�®ã‚¯ãƒãƒ¼ãƒ©ãƒ¼ã�¯robots.txtã‚’æ£ã�—ã��処ç�†ã�—ã€�ç¦�æ¢ã�•ã‚Œã�¦ã�„ã‚‹URLã�«ã�¯ã‚¢ã‚¯ã‚»ã‚¹ ã�—ã�ªã�„よã�†ã�«ã�™ã‚‹ã�¹ã��ã�§ã�™ã€‚Gunghoã�¯ã�“ã�®robots.txt処ç�†ã�¨ã�„ã�†æ¯”較的é�¢å€’ã�ª 処ç�†ã‚’自動的ã�«è¡Œã�„ã�¾ã�™ã€‚memcachedã�¨ã�¨ã‚‚ã�«ä½¿ç”¨ã�™ã‚Œã�°åˆ†æ•£ç’°å¢ƒã�§ã‚‚使用å�¯èƒ½ã�§ã�™ã€‚
ãƒãƒœãƒƒãƒˆãƒ‡ã‚£ãƒ¬ã‚¯ãƒ†ã‚£ãƒ–ã�¯HTMLã�®METAタグ内ã�«åŸ‹ã‚�è¾¼ã�¾ã‚Œã�Ÿãƒãƒœãƒƒãƒˆç”¨ã�®åˆ¶å¾¡æ§‹æ–‡ ã�§ã�™ã€‚Gunghoã�§ã�¯ã�“ã�®ãƒ‡ã‚£ãƒ¬ã‚¯ãƒ†ã‚£ãƒ–を自動的ã�«ãƒ‘ースã�—ã€�ユーザーã�Œæ‰±ã�ˆã‚‹ã‚ˆã�†ã�« ã�—ã�¾ã�™ã€‚
クãƒãƒ¼ãƒ«å¯¾è±¡ã�¨ã�ªã�£ã�¦ã�„るサイトã�«é�Žåº¦ã�®è² è�·ã‚’ã�‹ã�‘ã�¦ã‚µã‚¤ãƒˆã‚’è�½ã�¨ã�—ã�¦ã�¯å…ƒã‚‚å�ã‚‚ ã�‚ã‚Šã�¾ã�›ã‚“。スãƒãƒƒãƒˆãƒªãƒ³ã‚°ãƒ¢ã‚¸ãƒ¥ãƒ¼ãƒ«ã‚’使ã�†äº‹ã�«ã‚ˆã�£ã�¦Gunghoã�§ã�¯ãƒªã‚¯ã‚¨ã‚¹ãƒˆæ•°ã‚’ 絞り込む事ã�Œå�¯èƒ½ã�§ã�™ã€‚
クãƒãƒ¼ãƒ«ã�—ã�¦ã�„るサイトã�®DNSã�®è¨å®šã�Œé–“é�•ã�£ã�¦ã�„ã�Ÿã‚Šã€�æ„�図的ã�«ã��ã�®ã‚ˆã�†ã�ªURLã‚’ 埋ã‚�込んã�§ã�‚ã�£ã�Ÿå ´å�ˆãƒªã‚¯ã‚¨ã‚¹ãƒˆã�Œè‡ªåˆ†ã�®å†…部ãƒ�ットワークã�®IPアドレスã�«å�‘ã�„ã�¦ã�—ã�¾ã�„ DoSを引ã��èµ·ã�“ã�™å�¯èƒ½æ€§ã�Œã�‚ã‚Šã�¾ã�™ã€‚ã�“ã�®ã‚»ã‚ュリティリスクをGunghoを監視ã�—ã�¾ã�™ã€‚
Catalystã‚ャッシュã�®ã‚ˆã�†ã�ªã‚ャッシュを使ã�„ã�Ÿã�„å ´å�ˆã�¯Cacheコンãƒ�ーãƒ�ントを 使用ã�™ã‚‹ã� ã�‘ã�§ãƒ—ãƒã‚°ãƒ©ãƒ 内ã�‹ã‚‰ã‚ャッシュを扱ã�ˆã‚‹ã‚ˆã�†ã�«ã�ªã‚Šã�¾ã�™ã€‚
Web::Scraperã‚’Gungho内ã�‹ã‚‰ç°¡å�˜ã�«æ‰±ã�ˆã‚‹ã‚ˆã�†ã�«ã�—ã�¦ã�„ã�¾ã�™ (ã�“ã�®æ©Ÿèƒ½ã�¯ç�¾åœ¨ã�¾ã� 安定稼åƒ�ã�—ã�¦ã�„ã�¾ã�›ã‚“)
RequestLogプラグインを使用ã�™ã‚‹ã�“ã�¨ã�«ã‚ˆã�£ã�¦è‡ªå‹•çš„ã�«å�–å¾—ã�•ã‚Œã�¦ã�„ã��URLã‚’ ãƒã‚°ã�—ã�¦è¡Œã��事ã�Œã�§ã��ã�¾ã�™ã€‚
First there were a bunch of scripts that used scrape a bunch of RSS feeds. Then I got tired of writing scripts, so I decided a framework is the way to go, and Xango was born.
Xango was my first attempt at trying to harness the full power of event-based framework. It was fast. It wasn't fun to extend. It had a nightmare-ish way to deal with robots.txt.
Couple of more attempts later, more inspirations and lessons learned from Catalyst, Plagger, DBIx::Class, Gungho was born.
Since its inception, Gungho has been in successfully used as crawlers that fetch hundreds of thousands of urls to a few million urls per day.
Gungho is designed to so that it can handle massive amount of traffic. If you're careful enough with your Provider and Handler implementation, you can in fact hit millions of URL with this crawler.
So PLEASE DO NOT LET IT LOOSE. DO NOT OVERLOAD your crawl targets. You are STRONGLY advised to use Gungho::Component::Throttle to throttle your fetches.
Also PLEASE CHANGE THE USER AGENT NAME OF YOUR CRAWLER. If you hit your targets hard with the default name (Gungho/VERSION X.XXXX), it will look as though a service called Gungho is hitting their site, which really isn't the case. Whatever it is, please specify at least a simple user agent in your config
Gungho is comprised of three parts. A Provider, which provides Gungho with requests to process, a Handler, which handles the fetched page, and an Engine, which controls the entire process.
There are also "hooks". These hooks can be registered from anywhere by invoking the register_hook() method. They are run at particular points, which are specified when you call register_hook().
All components (engine, provider, handler) are overridable and switcheable. However, do note that if you plan on customizing stuff, you should be aware that Gungho uses Class::C3 extensively, and hence you may see warnings about the code you use.
Gunghoã�¯è†¨å¤§ã�ªæ•°ã�®URLã‚’æ�’常的ã�«å�–å¾—ã�™ã‚‹ã�Ÿã‚�ã�«è¨è¨ˆã�•ã‚Œã�¦ã�„ã�¾ã�™ã€‚ã‚‚ã�— Gunghoã‚’ã�²ã�¨ã�¤ã�®URLã€�ã‚‚ã�—ã��ã�¯ã�²ã�¨ã�¤ã�®ãƒ›ã‚¹ãƒˆã�«å¯¾ã�—ã�¦æ‰±ã�†ã�®ã�§ã�‚ã‚Œã�°æ³¨æ„�ã‚’ è¦�ã�—ã�¾ã�™ã€‚
上記ã�®ã‚ˆã�†ã�ªç’°å¢ƒã�§Gunghoã‚’å‹•ã�‹ã�™å ´å�ˆã�¯å��分ã�ªãƒ‘フォーマンスã�Œå‡ºã�›ã�ªã�„å�¯èƒ½æ€§ã�Œ 高ã��ã€�ã�²ã‚‡ã�£ã�¨ã�™ã‚‹ã�¨LWP::UserAgentã�®ã‚ˆã�†ã�ªãƒ¢ã‚¸ãƒ¥ãƒ¼ãƒ«ã‚’使ã�£ã�Ÿã�»ã�†ã�Œè‰¯ã�„ã�‹ã‚‚ ã�—ã‚Œã�¾ã�›ã‚“。
ã‚‚ã�¡ã‚�ã‚“LWP::UserAgentã�«ã�¯å˜åœ¨ã�—ã�ªã�„Gunghoã�®æ©Ÿèƒ½ã‚’使用ã�™ã‚‹ã�Ÿã‚�ã�«Gunghoã‚’ 使ã�†ã�®ã‚‚よã�„ã�‹ã‚‚知れã�¾ã�›ã‚“ã�Œã€�ãƒ�ューニングã�Œå¿…è¦�ã�§ã�‚ã‚‹ã�“ã�¨ã‚’èª�è˜ã�—ã�¦ã�„ã�¦ ã��ã� ã�•ã�„
--- debug: 1
Setting debug to a non-zero value will trigger debug messages to be displayed.
Components add new functionality to Gungho. Components are loaded at startup time from the config file / hash given to Gungho constructor.
Gungho->run({ components => [ 'Throttle::Simple' ], throttle => { max_interval => ..., } });
Components modify Gungho's inheritance structure at run time to add extra functionality to Gungho, and therefore should only be loaded before starting the engine.
Please refer to each component's document for details
If you're looking into simple crawlers, you may want to look at Gungho::Inline,
Gungho::Inline->run({ provider => sub { ... }, handler => sub { ... } });
See the manual for Gungho::Inline for details.
Plugins are different from components in that, whereas components require the developer to explicitly call the methods, plugins are loaded and are not touched afterwards.
Please refer to the documentation of each plugin for details.
Currently available hooks are:
Used for Class::C3::Componentised
コード�Google Code�管��れ����。レ�ジトリ�以下URL��管�れ����
http://gungho-crawler.googlecode.com/svn/trunk
Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html
1 POD Error
The following errors were encountered while parsing the POD:
Non-ASCII character seen before =encoding in '高性能Webクãƒãƒ¼ãƒ©ãƒ¼ãƒ•ãƒ¬ãƒ¼ãƒ ワーク'. Assuming CP1252
To install Gungho, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Gungho
CPAN shell
perl -MCPAN -e shell install Gungho
For more information on module installation, please visit the detailed CPAN module installation guide.