WWW::Crawl - A simple web crawler for extracting links and more from web pages
This documentation refers to WWW::Crawl version 0.1.
use WWW::Crawl; my $crawler = WWW::Crawl->new(); my $url = 'https://example.com'; my @visited = $crawler->crawl($url, \&process_page); sub process_page { my $url = shift; print "Visited: $url\n"; # Your processing logic here }
The WWW::Crawl module provides a simple web crawling utility for extracting links and other resources from web pages within a single domain. It can be used to recursively explore a website and retrieve URLs, including those found in HTML href attributes, form actions, external JavaScript files, and JavaScript window.open links.
WWW::Crawl
WWW::Crawl will not stray outside the supplied domain.
Creates a new WWW::Crawl object. You can optionally provide the following options as key-value pairs:
agent: The user agent string to use for HTTP requests. Defaults to "Perl-WWW-Crawl-VERSION" where VERSION is the module version.
agent
timestamp: If a timestamp is added to external JavaScript files to ensure the latest version is loaded by the browser, this option prevents multiple copied of the same file being indexed by ignoring the timestamp query parameter.
timestamp
nolinks: Don't follow links found in the starting page. This option is provided for testing and prevents WWW::Crawl following the links it finds. It also affects the return value of the crawl method.
nolinks
Starts crawling the web starting from the given URL. The $url parameter specifies the starting URL.
$url
The optional $callback parameter is a reference to a subroutine that will be called for each visited page. It receives the URL of the visited page as an argument.
$callback
The crawl method will explore the provided URL and its linked resources. It will also follow links found in form actions, external JavaScript files, and JavaScript window.open links. The crawling process continues until no more unvisited links are found.
crawl
In exploring the website, crawl will ignore links to the following types of file .pdf, .css, .png, .jpg, .svg and .webmanifest
.pdf
.css
.png
.jpg
.svg
.webmanifest
Returns an array of URLs that were parsed during the crawl. Unless the nolinks option is passed to new, then it returns an array of links found on the itial page.
Ian Boddison, <bod at cpan.org>
<bod at cpan.org>
Please report any bugs or feature requests to bug-www-crawl at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=WWW-Crawl. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-www-crawl at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc WWW::Crawl
You can also look for information at:
GitHub
https://github.com/IanBod/WWW-Crawl
RT: CPAN's request tracker (report bugs here)
https://rt.cpan.org/NoAuth/Bugs.html?Dist=WWW-Crawl
Search CPAN
https://metacpan.org/release/WWW-Crawl
This software is Copyright (c) 2023 by Ian Boddison.
This program is released under the following license:
Perl
To install WWW::Crawl, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WWW::Crawl
CPAN shell
perl -MCPAN -e shell install WWW::Crawl
For more information on module installation, please visit the detailed CPAN module installation guide.