The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.


WWW::Find - Web Resource Finder


use LWP::UserAgent; use HTTP::Request; use WWW::Find;

$agent = LWP::UserAgent->new;

$request = HTTP::Request->new(GET => 'http://begin.url');

$find = WWW::Find->new(AGENT => $agent, REQUEST => $request, MAX_DEPTH => 2, MATCH_SUB => \&match, FOLLOW_SUB => \&follow );



HTML::LinkExtor LWP::UserAgent HTTP::Request URI


WWW::Find simplifies the task of searching the web for specific types of information. The inspiration for this project came from the recursive website mirroring program, w3mir. WWW::Find is similar to w3mir, but with a more general feature set.

In a nutshell, a WWW::Find object extracts all the HREF links from an HTML document, creates a HTTP::Request object for each link, matches the HTTP::Response object against user specified criteria, and then does something with the matching links (possibly performing the entire operation all over again on certain links). Be careful not to set the MAX_DEPTH parameter too high, otherwise you could easily begin the endless task of requesting every page on the net!

In addition to a LPW::UserAgent and a HTTP::Request object, you'll need to create two subroutines: a &match subroutine and a &follow subroutine.

The &follow subroutine should attempt to match the HTTP::Response object against user defined criteria. If a match is found, the entire operation is performed all over again on the matching link. For example, the following subroutine matches links where the header content-type matches the regular expression /text/.

sub follow { my $find_obj = shift; my $header = HTTP::Request->new(HEAD => $find_obj->{REQUEST}->uri); my $response = $find_obj->{AGENT}->request($header) || next; $response->content_type =~ /text/io ? return 1 : return 0; }

The &match subroutine should perform some operation on links matching user defined criteria. For example, the following subroutine simply prints out the URL of all links matching the regular expression /html?$/

sub match { my $find_obj = shift; if($find_obj->{REQUEST}->uri =~ /html?$/io) { print $find_obj->{REQUEST}->uri . "\n"; } return; }


HTTP::Request LPW::UserAgent


Nathaniel Graham, <<gt> is the offical home page of WWW::Find


Copyright 2003 by Nathaniel Graham

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.