NAME

WWW::Crawler::Lite - A single-threaded crawler/spider for the web.

SYNOPSIS

  my %pages = ( );
  my $pattern = 'https?://example\.com\/';
  my %links = ( );
  my $downloaded = 0;

  my $crawler;
  $crawler = WWW::Crawler::Lite->new(
    agent       => 'MySuperBot/1.0',
    url_pattern => $pattern,
    http_accept => [qw( text/plain text/html )],
    on_response => sub {
      my ($url, $res) = @_;
      
      warn "$url contains " . $res->content;
      $downloaded++;
      $crawler->stop() if $downloaded++ > 5;
    },
    on_link     => sub {
      my ($from, $to, $text) = @_;
      
      return if exists($pages{$to}) && $pages{$to} eq 'BAD';
      $pages{$to}++;
      $links{$to} ||= [ ];
      push @{$links{$to}}, { from => $from, text => $text };
    },
    on_bad_url => sub {
      my ($url) = @_;
      
      # Mark this url as 'bad':
      $pages{$url} = 'BAD';
    }
  );
  $crawler->crawl( url => "http://example.com/" );

  warn "DONE!!!!!";

  use Data::Dumper;
  map {
    warn "$_ ($pages{$_} incoming links) -> " . Dumper($links{$_})
  } sort keys %links;

DESCRIPTION

WWW::Crawler::Lite is a single-threaded spider/crawler for the web. It can be used within a mod_perl, CGI or Catalyst-style environment because it does not fork or use threads.

The callback-based interface is fast and simple, allowing you to focus on simply processing the data that WWW::Crawler::Lite extracts from the target website.

PUBLIC METHODS

new( %args )

Creates and returns a new WWW::Crawler::Lite object.

The %args hash is not required, but may contain the following elements:

agent - String

Used as the user-agent string for HTTP requests.

Default Value: - WWW-Crawler-Lite/$VERSION $^O

url_pattern - RegExp or String

New links that do not match this pattern will not be added to the processing queue.

Default Value: https?://.+

http_accept - ArrayRef

This can be used to filter out unwanted responses.

Default Value: [qw( text/html, text/plain, text/xhtml )]

on_response($url, $response) - CodeRef

Called whenever a successful response is returned.

on_link($from, $to, $text) - CodeRef

Called whenever a new link is found. Arguments are:

$from: The URL that is linked *from*
$to: The URL that is linked *to*
$text: The anchor text (eg: The HTML within the link - <a href="...">This Text Here</a>)

on_bad_url($url) - CodeRef

Called whenever an unsuccessful response is received.

delay_seconds - Number

Indicates the length of time (in seconds) that the crawler should pause before making each request. This can be useful when you want to spider a website, not launch a denial of service attack on it.

stop( )

Causes the crawler to stop processing its queue of URLs.

AUTHOR

John Drago <jdrago_999@yahoo.com>

COPYRIGHT

This software is Free software and may be used and redistributed under the same terms as perl itself.

To install WWW::Crawler::Lite, copy and paste the appropriate command in to your terminal.

cpanm

cpanm WWW::Crawler::Lite

CPAN shell

perl -MCPAN -e shell
install WWW::Crawler::Lite

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)