The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Crawl - A simple web crawler for extracting links and more from web pages

VERSION

This documentation refers to WWW::Crawl version 0.1.

SYNOPSIS

    use WWW::Crawl;

    my $crawler = WWW::Crawl->new();
    
    my $url = 'https://example.com';
    
    my @visited = $crawler->crawl($url, \&process_page);

    sub process_page {
        my $url = shift;
        print "Visited: $url\n";
        # Your processing logic here
    }

DESCRIPTION

The WWW::Crawl module provides a simple web crawling utility for extracting links and other resources from web pages within a single domain. It can be used to recursively explore a website and retrieve URLs, including those found in HTML href attributes, form actions, external JavaScript files, and JavaScript window.open links.

WWW::Crawl will not stray outside the supplied domain.

CONSTRUCTOR

new(%options)

Creates a new WWW::Crawl object. You can optionally provide the following options as key-value pairs:

  • agent: The user agent string to use for HTTP requests. Defaults to "Perl-WWW-Crawl-VERSION" where VERSION is the module version.

    timestamp: If a timestamp is added to external JavaScript files to ensure the latest version is loaded by the browser, this option prevents multiple copied of the same file being indexed by ignoring the timestamp query parameter.

    nolinks: Don't follow links found in the starting page. This option is provided for testing and prevents WWW::Crawl following the links it finds. It also affects the return value of the crawl method.

METHODS

crawl($url, [$callback])

Starts crawling the web starting from the given URL. The $url parameter specifies the starting URL.

The optional $callback parameter is a reference to a subroutine that will be called for each visited page. It receives the URL of the visited page as an argument.

The crawl method will explore the provided URL and its linked resources. It will also follow links found in form actions, external JavaScript files, and JavaScript window.open links. The crawling process continues until no more unvisited links are found.

In exploring the website, crawl will ignore links to the following types of file .pdf, .css, .png, .jpg, .svg and .webmanifest

Returns an array of URLs that were parsed during the crawl. Unless the nolinks option is passed to new, then it returns an array of links found on the itial page.

AUTHOR

Ian Boddison, <bod at cpan.org>

BUGS

Please report any bugs or feature requests to bug-www-crawl at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=WWW-Crawl. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc WWW::Crawl

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

This software is Copyright (c) 2023 by Ian Boddison.

This program is released under the following license:

  Perl