++ed by:

1 non-PAUSE user.

Bryce Harrington


WWW::PkgFind - Spiders given URL(s) mirroring wanted files and triggering post-processing (e.g. tests) against them.


my $Pkg = new WWW::PkgFind("my_package");

$Pkg->depth(3); $Pkg->active_urls("ftp://ftp.somesite.com/pub/joe/foobar/"); $Pkg->wanted_regex("patch-2\.6\..*gz", "linux-2\.6.\d+\.tar\.bz2"); $Pkg->set_create_queue("/testing/packages/QUEUE"); $Pkg->retrieve();


This module provides a way to mirror new packages on the web and trigger post-processing operations against them. It allows you to point it at one or more URLs and scan for any links matching (or not matching) given patterns, and downloading them to a given location. Newly downloaded files are also identified in a queue for other programs to perform post-processing operations on, such as queuing test runs.


new([$pkg_name], [$agent_desc])

Creates a new WWW::PkgFind object, initializing all data members.

pkg_name is an optional argument to specify the name of the package. WWW::PkgFind will place files it downloads into a directory of this name. If not defined, will default to "unnamed_package".

agent_desc is an optional parameter to be appended to the user agent string that WWW::PkgFind uses when accessing remote websites.


Gets or sets the package name. When a file is downloaded, it will be placed into a sub-directory by this name.


Gets or sets the depth to spider below URLs. Set to 0 if only the specified URL should be scanned for new packages. Defaults to 5.

A typical use for this would be if you are watching a site where new patches are posted, and the patches are organized by the version of software they apply to, such as ".../linux/linux-2.6.17/*.dif".

wanted_regex($regex1, [$regex2, ...])

Gets or adds a regular expression to control what is downloaded from a page. For instance, a project might post source tarballs, binary tarballs, zip files, rpms, etc., but you may only be interested in the source tarballs. You might specify this by calling

    $self->wanted_regex("^.*\.tar\.gz$", "^.*\.tgz$");

By default, all files linked on the active urls will be retrieved (including html and txt files.)

You can call this function multiple times to add additional regex's.

The return value is the current array of regex's.


Gets or adds a regular expression to control what is downloaded from a page. Unlike the wanted_regex, this specifies what you do *not* want. These regex's are applied after the wanted_regex's, thus allowing you to fine tune the selections.

A typical use of this might be to limit the range of release versions you're interested in, or to exclude certain packages (such as pre-release versions).

You can call this function multiple times to add additional regexp's.

The return value is the current array of regex's.


Sets or gets the list of mirrors to use for the package. This causes the URL to be modified to include the mirror name prior to retrieval. The mirror used will be selected randomly from the list of mirrors provided.

This is designed for use with SourceForge's file mirror system, allowing WWW::PkgFind to watch a project's file download area on prdownloads.sourceforge.net and retrieve files through the mirrors.

You can call this function multiple times to add additional regexp's.


Gets or sets the URL template to use when fetching from a mirror system like SourceForge's. The strings "MIRROR" and "FILENAME" in the URL will be substituted appropriately when retrieve() is called.

active_urls([$url1], [$url2], ...)

Gets or adds URLs to be scanned for new file releases.

You can call this function multiple times to add additional regexp's.


Returns a list of the files that were found at the active URLs, that survived the wanted_regex and not_wanted_regex patterns. This is for informational purposes only.


Returns true if retrieved() has been called.


Specifies that the retrieve() routine should also create a symlink queue in the specified directory.


Turns on debug level. Set to 0 or undef to turn off.


Checks the regular expressions in the Pkg hash. Returns 1 (true) if file matches at least one wanted regexp and none of the not_wanted regexp's. If the file matches a not-wanted regexp, it returns 0 (false). If it has no clue what the file is, it returns undef (false).

get_file($url, $dest)

Retrieves the given URL, returning true if the file was successfully obtained and placed at $dest, false if something prevented this from happening.

get_file also checks for and respects robot rules, updating the $rules object as needed, and caching url's it's checked in %robot_urls. $robot_urls{$url} will be >0 if a robots.txt was found and parsed, <0 if no robots.txt was found, and undef if the url has not yet been checked.


This function performs the actual scanning and retrieval of packages. Call this once you've configured everything. The required parameter $destination is used to specify where on the local filesystem files should be stored. retrieve() will create a subdirectory for the package name under this location, if it doesn't already exist.

The function will obey robot rules by checking for a robots.txt file, and can be made to navigate a mirror system like SourceForge (see mirrors() above).

If configured, it will also create a symbolic link to the newly downloaded file(s) in the directory specified by the set_create_queue() function.


Bryce Harrington <bryce@osdl.org>


Copyright (C) 2006 Bryce Harrington. All Rights Reserved.

This script is free software; you can redistribute it and/or modify it under the same terms as Perl itself.