The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

WWW::Spyder - a simple non-persistent web crawler.

VERSION 0.19

SYNOPSIS

A web spider that returns plain text, HTML, and other information per page crawled and can determine what pages to get and parse based on supplied terms compared to the text in links as well as page content.

METHODS

  • $spyder->new()

    Construct a new spyder object. Without at least the seed() set, or go_to_seed() turned on, the spyder isn't ready to crawl.

    $spyder = WWW::Spyder->new(shift||die"Gimme a URL!\n");
       # ...or...
    $spyder = WWW::Spyder->new( %options );

    Options include: sleep_base (in seconds), exit_on (hash of methods and settings). Examples below.

  • $spyder->seed($url)

    Adds a URL (or URLs) to the top of the queues for crawling. If the spyder is constructed with a single scalar argument, that is considered the seed.

  • $spyder->bell([bool])

    This will print a bell ("\a") to STDERR on every successfully crawled page. It might seem annoying but it is an excellent way to know your spyder is behaving and working. True value turns it on. Right now it can't be turned off.

  • $spyder->spyder_time([bool])

    Returns raw seconds since Spyder was created if given a boolean value, otherwise returns "D day(s) HH::MM:SS."

  • $spyder->terms([list of terms to match])

    The more terms, the more the spyder is going to grasp at. If you give a straight list of strings, they will be turned into very open regexes. E.g.: "king" would match "sulking" and "kinglet" but not "King." It is case sensitive right now. If you want more specific matching or different behavior, pass your own regexes instead of strings.

    $spyder->terms( qr/\bkings?\b/i, qr/\bqueens?\b/i );

    terms() is only settable once right now, then it's a done deal.

  • $spyder->spyder_data()

    A comma formatted number of kilobytes retrieved so far. Don't give it an argument. It's a set/get routine.

  • $spyder->slept()

    Returns the total number of seconds the spyder has slept while running. Useful for getting accurate page/time counts (spyder performance) discounting the added courtesy naps.

  • $spyder->UA->...

    The LWP::UserAgent. You can reset them, I do believe, by calling methods on the UA. Here are the initialized values you might want to tweak (see LWP::UserAgent for more information):

    $spyder->UA->timeout(30);
    $spyder->UA->max_size(250_000);
    $spyder->UA->agent('Mozilla/5.0');

    Changing the agent name can hurt your spyder b/c some servers won't return content unless it's requested by a "browser" they recognize.

    You should probably add your email with from() as well.

    $spyder->UA->from('bluefintuna@fish.net');
  • $spyder->cookie_file([local_file])

    They live in $ENV{HOME}/spyderCookie by default but you can set your own file if you prefer or want to save different cookie files for different spyders.

Weird courteous behavior

Courtesy didn't used to be weird, but that's another story. You will probably notice that the courtesy routines force a sleep when a recently seen domain is the only choice for a new link. The sleep is partially randomized. This is to prevent the spyder from being recognized in weblogs as a robot.

The web and courtesy

Please, I beg of thee, exercise the most courtesy you can. Don't let impatience get in the way. Bandwidth and server traffic are $MONEY for real. The web is an extremely disorganized and corrupted database at the root but companies and individuals pay to keep it available. The less pain you cause by banging away on a webserver with a web agent, the more welcome the next web agent will be.

Update: Google seems to be excluding generic LWP agents now. See, I told you so. A single parallel robot can really hammer a major server, even someone with as big a farm and as much bandwidth as Google.

VERBOSITY

  • $spyder->verbosity([1-6]) -OR-

  • $WWW::Spyder::VERBOSITY = ...

    Set it from 1 to 6 right now to get varying amounts of extra info to STDOUT. It's an uneven scale and will be straightened out pretty soon. If kids have a preference for sending the info to STDERR, I'll do that. I might anyway.

SAMPLE USAGE

See "spyder-mini-bio" in this distribution

It's an extremely simple, but fairly cool pseudo bio-researcher.

Simple continually crawling spyder:

In the following code snippet:

use WWW::Spyder;

my $spyder = WWW::Spyder->new( shift || die"Give me a URL!\n" );

while ( my $page = $spyder->crawl ) {

   print '-'x70,"\n";
   print "Spydering: ", $page->title, "\n";
   print "      URL: ", $page->url, "\n";
   print "     Desc: ", $page->description || 'n/a', "\n";
   print '-'x70,"\n";
   while ( my $link = $page->next_link ) {
       printf "%22s ->> %s\n",
       length($link->name) > 22 ?
           substr($link->name,0,19).'...' : $link->name,
           length($link) > 43 ?
               substr($link,0,40).'...' : $link;
   }
}

as long as unique URLs are being found in the pages crawled, the spyder will never stop.

Each "crawl" returns a page object which gives the following methods to get information about the page.

  • $page->links

    URLs found on the page.

  • $page->title

    Page's <TITLE> Title </TITLE> if there is one.

  • $page->text

    The parsed plain text out of the page. Uses HTML::Parser and tries to ignore non-readable stuff like comments and scripts.

  • $page->url

  • $page->domain

  • $page->raw

    The content returned by the server. Should be HTML.

  • $page->description

    The META description of the page if there is one.

  • $page->links

    Returns a list of the URLs in the page. Note: next_link() will shift the available list of links() each time it's called.

  • $link = $page->next_link

    next_link() destructively returns the next URI-ish object in the page. They are objects with three accessors.

    • $link->url

      This is also overloaded so that interpolating "$link" will get the URL just as the method does.

    • $link->name

    • $link->domain

Spyder that will give up the ghost...

The following spyder is initialized to stop crawling when either of its conditions are met: 10mins pass or 300 pages are crawled.

use WWW::Spyder;

my $url = shift || die "Please give me a URL to start!\n";

my $spyder = WWW::Spyder->new
     (seed        => $url,
      sleep_base  => 10,
      exit_on     => { pages => 300,
                       time  => '10min', },);

while ( my $page = $spyder->crawl ) {

   print '-'x70,"\n";
   print "Spydering: ", $page->title, "\n";
   print "      URL: ", $page->url, "\n";
   print "     Desc: ", $page->description || '', "\n";
   print '-'x70,"\n";
   while ( my $link = $page->next_link ) {
       printf "%22s ->> %s\n",
       length($link->name) > 22 ?
           substr($link->name,0,19).'...' : $link->name,
           length($link) > 43 ?
               substr($link,0,40).'...' : $link;
   }
}

Primitive page reader

use WWW::Spyder;
use Text::Wrap;

my $url = shift || die "Please give me a URL to start!\n";
@ARGV or die "Please also give me a search term.\n";
my $spyder = WWW::Spyder->new;
$spyder->seed($url);
$spyder->terms(@ARGV);

while ( my $page = $spyder->crawl ) {
    print '-'x70,"\n * ";
    print $page->title, "\n";
    print '-'x70,"\n";
    print wrap('','', $page->text);
    sleep 60;
}

TIPS

If you are going to do anything important with it, implement some signal blocking to prevent accidental problems and tie your gathered information to a DB_File or some such.

You might want to load POSIX::nice(40). It should top the nice off at your system's max and prevent your spyder from interfering with your system.

You might want to to set $| = 1.

PRIVATE METHODS

are private but hack away if you're inclined

TO DO

Spyder is conceived to live in a future namespace as a servant class for a complex web research agent with simple interfaces to pre-designed grammars for research reports; or self-designed grammars/reports (might be implemented via Parse::FastDescent if that lazy-bones Conway would just find another 5 hours in the paltry 32 hour day he's presently working).

I'd like the thing to be able to parse RTF, PDF, and perhaps even resource sections of image files but that isn't on the radar right now.

TO DOABLE BY 1.0

Add 2-4 sample scripts that are a bit more useful.

There are many functions that should be under the programmer's control and not buried in the spyder. They will emerge soon. I'd like to put in hooks to allow the user to keep(), toss(), or exclude(), urls, link names, and domains, while crawling.

Clean up some redundant, sloppy, and weird code. Probably change or remove the AUTOLOAD.

Put in a go_to_seed() method and a subclass, ::Seed, with rules to construct query URLs by search engine. It would be the autostart or the fallback for perpetual spyders that run out of links. It would hit a given or default search engine with the Spyder's terms as the query. Obviously this would only work with terms() defined.

Implement auto-exclusion for failure vs. success rates on names as well as domains (maybe URI suffixes too).

Turn length of courtesy queue into the breadth/depth setting? make it automatically adjusting...?

Consistently found link names are excluded from term strength sorting? Eg: "privacy policy," "read more," "copyright..."

Fix some image tag parsing problems and add area tag parsing.

Configuration for user:password by domain.

::Page objects become reusable so that a spyder only needs one.

::Enqueue objects become indexed so they are nixable from anywhere.

Expand exit_on routines to size, slept time, dwindling success ratio, and maybe more.

Make methods to set "skepticism" and "effort" which will influence the way the terms are used to keep, order, and toss URLs.

BE WARNED

This module already does some extremely useful things but it's in its infancy and it is conceived to live in a different namespace and perhaps become more private as a subservient part of a parent class. This may never happen but it's the idea. So don't put this into production code yet. I am endeavoring to keep its interface constant either way. That said, it could change completely.

Also!

This module saves cookies to the user's home. There will be more control over cookies in the future, but that's how it is right now. They live in $ENV{HOME}/spyderCookie.

Anche!

Robot Rules aren't respected. Spyder endeavors to be polite as far as server hits are concerned, but doesn't take "no" for answer right now. I want to add this, and not just by domain, but by page settings.

UNDOCUMENTED FEATURES

A.k.a. Bugs. Don't be ridiculous! Bugs in my code?!

There is a bug that is causing retrieval of image src tags, I think but haven't tracked it down yet, as links. I also think the plain text parsing has some problems which will be remedied shortly.

If you are building more than one spyder in the same script they are going to share the same exit_on parameters because it's a self-installing method. This will not always be so.

See Bugs file for more open and past issues.

Let me know if you find any others. If you find one that is platform specific, please send patch code/suggestion b/c I might not have any idea how to fix it.

WHY Spyder?

I didn't want to use the more appropriate Spider because I think there is a better one out there somewhere in the zeitgeist and the namespace future of Spyder is uncertain. It may end up a semi-private part of a bigger family. And I may be King of Kenya someday. One's got to dream.

If you like Spyder, have feedback, wishlist usage, better algorithms/implementations for any part of it, please let me know!

AUTHOR, AUTHOR

Ashley5, ashley@cpan.org. Bob's your monkey's uncle.

COPYRIGHT

(c)2001-2002 Ashley Pond V. All rights reserved. This program is free software; you may redistribute or modify it under the same terms as Perl.

THANKS TO

Most all y'all. Especially Lincoln Stein, Gisle Aas, The Conway, Raphael Manfredi, Gurusamy Sarathy, and plenty of others.

COMPARE WITH (PROBABLY PREFER)

WWW::Robot, LWP::UserAgent, WWW::SimpleRobot, WWW::RobotRules, LWP::RobotUA, and other kith and kin.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 936:

You forgot a '=back' before '=head2'