The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Xango::Broker - Broker HTTP Requests

SYNOPSIS

  use Xango::Broker;
  use MyHandler;
  MyHandler->spawn();
  Xango::Broker->spawn();
  POE::Kernel->run();

  # or,
  xango -h MyHandler

DESCRIPTION

Xango is a generic web crawler framework written using POE (http://poe.perl.org), a cooperative multitasking framework.

Xango::Broker is Xango's main POE component but it doesn't do much by itself: Instead, you need to write a handler that does all the application-specific work where most of the interesting bits are done.

Xango::Broker is mainly responsible for three things: (1) Setting up the general environment, (2) providing the processig pipeline for the most common crawler behavior, and (3) handling the HTTP fetches as well as their states. Your handler will be part of (2) above, as the component that is responsible for the following things:

Provide the data to fetch

You need to tell Xango::Broker what to fetch :)

Handle the HTTP response.

...And you need to process the response that you get after Xango::Broker fetches the requested URI.

Please see the section HANDLER API for more details.

CONFIGURATION VARIABLES

Configuration variables are written in YAML format. Please see the documentation for YAML for more information on how to write the configuration file.

If your custom web crawler requires more configuration parameters, you can safely specify more stuff in the same config file, so as long as it does not clash with an already existing parameter name that is requried by Xango::Broker.

To use these configuration variables, you need to use Xango::Config:

  use Xango::Config qw(filename.conf);
  # or
  Xango::Config->init('filename.conf');

or, you can pass it to the Xango::Broker's spawn() method :

  Xango::Broker->spawn(conf => 'filename.conf');

Once initialized, you may refer to the same Xango::Config instance from anywhere in your code. Please see Xango::Config for more details.

HttpComponentClass (string)

    Class name of the POE component that handles HTTP communication. You may specify any class, so as long as it has interfaces matching POE::Component::Client::HTTP.

    Defaults to 'POE::Component::Client::HTTP'

HttpComponentArgs (list or hash)

    Arguments that are passed to the spawn() method of the HTTP component class. You almost always want to specify the 'Timeout' parameter if you're using POE::Component::Client::HTTP (or the like)

    Note that you may not specify the 'Alias' parameter. This is internally used by Xango::Broker. If you specify it, it will silently be ignored

DnsCacheClass (string)

    Xango internally caches DNS lookup results to avoid the overhead of having to query for IP address. This configuration variable specifies the class name of the cache object to hold DNS query results. Defaults to Cache::FileCache.

DnsCacheArgs (hash)

    Arguments to pass to the cache constructor. You must provide this if you are using anything other than Cache::FileCache as your cache class.

DnsCacheArgsDeref (boolean)

If specified to true, it will dereference the value specified for DnsCacheArgs before passing it to the constructor. Use this if you want to use cache engine that requires non-reference parameter list to the constructor

MaxHttpAgents (integer)

    The number of concurrent http agents (i.e. the number of POE::Component::Client::HTTP sessions) that are allowed. The default is 10, but for anything other than a toy application, something in the order of 50 ~ 100 is the recommended value.

    Unless this number is less than 10, the broker starts with 10 sessions, and successively grows the pool of agents when there are not enough agents to handle the currently available jobs, until the maximum is reached.

    If the max is less than 10, the starting number if equal to the max.

MaxSilenceInterval (integer)

    The number of seconds that we allow an agent to be inactive for. Once a fetcher session is inactive for this much amount of time, the sessions is stopped via detach_child(). The default is 300 seconds.

JobRetrievalDelay (integer)

    The number of seconds to wait between calls to 'retrieve_job' state of the handler session. The default is 15 seconds.

ReloadConfig (integer)

    The number of seconds to wait before reloading configuration parameters from the config file. If set to 0, reload is disabled.

HANDLER API

The handler, which is where your application specific logic goes, must implement events that are listed below.

Note that the handler must be alias appropriately, as 'handler'. Don't forget to put something like this in your handler session's _start() method so that the alias is set properly:

  sub _start
  {
     my($kernel) = @_[KERNEL];
     $kernel->alias_set('handler');
  }

  sub _stop
  {
     my($kernel) = @_[KERNEL];
     $kernel->alias_remove('handler');
  }

Below are the states that are recognized in the handler session. Those states with a (*) next to them are mandatory:

load_config

    This state is called whenever the configuration is (re)loaded from a file. Use this state to refresh variables that are specific to the handler.

retrieve_jobs (*)

    This state is responsible for retrieve jobs to be processed by Xango from wherever you decide to store your original data (RDBMS, file system, manual user input, etc).

    It should return a list of hashref, which must contain at least 1 element named 'uri'. You may add any other elements, except 'id', 'fetcher', 'host_ip', and 'host_name', which are used internally by Xango. (However, you are welcome to use these values as read-only variables).

      sub retrieve_jobs
      {
         while (my $uri = get_next_uri()) {
            push @jobs_to_be_processed, {
                uri => $uri,
                my_var => $my_var,
                my_other_var => $my_other_var
            };
         }
         return @jobs_to_be_processed;
      }

    This state is called as a synchronous call via POE::Kernel->call(), so don't take forever to get the jobs to be processed!

apply_policy (*)

    This receives a job hash, and is supposed to figure if the particular job should be processed at all. Use this to apply black policy rules at the broker level (NOTE: if at all possible, do this at the storage level, such as a RDBMS server's stored procedure, as complicated policies will probably slow the broker down significantly).

    At the very least, if you are not applying any policies, write a stub pass-through state like below so that you just call the next state in the processing chain:

      sub apply_policy
      {
         my($kernel, $fetcher, $job) = @_[KERNEL, ARG0, ARG1];
         $kernel->post('broker', 'send_fetcher', $fetcher, $job);
      }

    Note, you *have* to call 'send_fetcher' in order for the job to be processed at all. If you otherwise do not wish to process this job, post to the broker session's 'finalize_job' state

      sub apply_policy
      {
         my($kernel, $job) = @_[KERNEL, ARG0];
         if ( $DONT_PROCESS ) {
            $kernel->post('broker', 'finalize_job', $job);
         } else {
            $kernel->post('broker', 'send_fetcher', $job);
         }
      }

    The job hash will be available in ARG0

prep_request

    Called right before the request is sent, you are given a chance to muck with the HTTP request in this state.

    The job hash will be available in ARG0, the HTTP::Request object will be available in ARG1

handle_response (*)

    As the name states, this state should handle the job, after the job's URI has been fetched. The HTTP::Response object is stored under the 'http_response' slot in the job, and you are free to do whatever you want with it -- because Xango doesn't do anything else with that job after this state.

    It is up to you to cook this piece of data, and store the results somewhere (or, discard them).

    The job hash will be available in ARG0

finalize_job

    This is sort of like a destructor for the job. The broker does its own cleanup, and then sends the job to the handler's 'finalize_job' state so that application-specific cleanup can be performed.

    The job hash will be available in ARG0

TODO

Tests

We need tests...

How-To Docs

Documentation on how to implement a toy crawler is necessary

BUGS

Plenty, I'm sure. Please report bugs via RT http://rt.cpan.org/NoAuth/Bugs.html?Dist=Xango

SEE ALSO

POE

AUTHOR

Copyright 2005 Daisuke Maki <dmaki@cpan.org>. All rights reserved. Development funded by Brazil, Ltd. <http://b.razil.jp>

1 POD Error

The following errors were encountered while parsing the POD:

Around line 902:

=back doesn't take any parameters, but you said =back 4