The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Xango::Manual::Intro - Learn How To Write Crawlers With Xango

DECIDE ON BROKER TYPE

There are two types of Xango::Broker -- the "pull" and "push" types. The "pull" crawler is implemented as Xango::Broker::Pull, and the "push" is implemented as Xango::Broker::Push.

Xango::Broker::Pull crawler periodically "pulls" the jobs that needs to be processed. Xango::Broker::Push crawler waits for jobs to be pushed to the queue.

You should decide on the type of crawler to use depending on the application that you want to build. If you have a constant pool of jobs that need to be processed, then usually the pull model is better. However, if your crawler's processing is triggered by an event (for example, a user submits a form that contains the URI to crawl) then the push crawler is the way to go.

IMPLEMENTING THE PULL CRAWLER

A "pull" crawler goes into a loop that periodically checks for new jobs to process from a source. If you have a crawler that should be crawling a possibly infinite list of jobs, then this is probably the type of crawler that you want to use.

In this scenario, the handler component needs to implement the following states:

apply_policy
retrieve_jobs

The retrieve_jobs state is called periodically to fetch new jobs to be processed. This is probably where you would be accessing a database or such.

In this state, just return a list of jobs:

  package MyHandler;
  sub retrieve_jobs
  {
    # of course, you probably wouldn't be connecting to a database every
    # time retrieve_jobs is called in reality (too much overhead)
    my $dbh = DBI->connect(...);

    my @jobs;
    my $sth = $dbh->prepare("SELECT ....");
    while ($sth->fetchrow_arrayref) {
      my $job = Xango::Job->new(uri => $uri, ...);
      push @jobs, $job;
    }
    return @jobs;
  }
handle_response

IMPLEMENTING THE PUSH CRAWLER

  Xango::Broker::Push->spawn(
    Alias => 'broker',
    
  );

  # In your code elsewhere...
  POE::Kernel->post('broker', 'enqueue_job', $job);
apply_policy
handle_response

PERFORMANCE

CUSTOM HTTP COMPONENT

For maximum performance, you will most likely need to re-write your HTTP fetcher component (by default POE::Component::Client::HTTP is used). This is due to the fact that, most crawler are daemon that perpetually retrieves data, and this requires maximum tuning from the code-writer to reduce the memory impact from downloading arbitrary content.

For example, http://xango.razil.jp uses a disk-based variation of POE::Component::Client::HTTP that writes data as soon as it can to minimize the memory usage.

For casual users, though, this may not be a problem.

AUTHOR

Copyright (c) 2005 Daisuke Maki <dmaki@cpan.org<>