Xango::Manual::Intro - Learn How To Write Crawlers With Xango
There are two types of Xango::Broker -- the "pull" and "push" types. The "pull" crawler is implemented as Xango::Broker::Pull, and the "push" is implemented as Xango::Broker::Push.
Xango::Broker::Pull crawler periodically "pulls" the jobs that needs to be processed. Xango::Broker::Push crawler waits for jobs to be pushed to the queue.
You should decide on the type of crawler to use depending on the application that you want to build. If you have a constant pool of jobs that need to be processed, then usually the pull model is better. However, if your crawler's processing is triggered by an event (for example, a user submits a form that contains the URI to crawl) then the push crawler is the way to go.
A "pull" crawler goes into a loop that periodically checks for new jobs to process from a source. If you have a crawler that should be crawling a possibly infinite list of jobs, then this is probably the type of crawler that you want to use.
In this scenario, the handler component needs to implement the following states:
The retrieve_jobs state is called periodically to fetch new jobs to be processed. This is probably where you would be accessing a database or such.
In this state, just return a list of jobs:
package MyHandler; sub retrieve_jobs { # of course, you probably wouldn't be connecting to a database every # time retrieve_jobs is called in reality (too much overhead) my $dbh = DBI->connect(...); my @jobs; my $sth = $dbh->prepare("SELECT ...."); while ($sth->fetchrow_arrayref) { my $job = Xango::Job->new(uri => $uri, ...); push @jobs, $job; } return @jobs; }
Xango::Broker::Push->spawn( Alias => 'broker', ); # In your code elsewhere... POE::Kernel->post('broker', 'enqueue_job', $job);
For maximum performance, you will most likely need to re-write your HTTP fetcher component (by default POE::Component::Client::HTTP is used). This is due to the fact that, most crawler are daemon that perpetually retrieves data, and this requires maximum tuning from the code-writer to reduce the memory impact from downloading arbitrary content.
For example, http://xango.razil.jp uses a disk-based variation of POE::Component::Client::HTTP that writes data as soon as it can to minimize the memory usage.
For casual users, though, this may not be a problem.
Copyright (c) 2005 Daisuke Maki <dmaki@cpan.org<>
To install Xango, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Xango
CPAN shell
perl -MCPAN -e shell install Xango
For more information on module installation, please visit the detailed CPAN module installation guide.