The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

WWW::Sitemap - functions for generating a site map for a given site URL.

SYNOPSIS

    use WWW::Sitemap;
    use LWP::UserAgent;

    my $ua = new LWP::UserAgent;
    my $sitemap = new WWW::Sitemap(
        EMAIL       => 'your@email.address',
        USERAGENT   => $ua,
        ROOT        => 'http://www.my.com/'
    );

    $sitemap->option( 'VERBOSE' => 1 );
    my $len = $sitemap->option( 'SUMMARY_LENGTH' );

    my $root = $sitemap->root();
    for my $url ( $sitemap->urls() )
    {
        if ( $sitemap->is_internal_url( $url ) )
        {
            # do something ...
        }
        my @links = $sitemap->links( $url );
        my $title = $sitemap->title( $url );
        my $summary = $sitemap->summary( $url );
        my $depth = $sitemap->depth( $url );
    }
    $sitemap->traverse(
        sub {
            my ( $sitemap, $url, $depth, $flag ) = @_;
            if ( $flag == 0 )
            {
                // do something at the start of a list of sub-pages ...
            }
            elsif( $flag == 1 )
            {
                // do something for each page ...
            }
            elsif( $flag == 2 )
            {
                // do something at the end of a list of sub-pages ...
            }
        }
    )

DESCRIPTION

The WWW::Sitemap module creates a sitemap for a site, by traversing the site using the WWW::Robot module. The sitemap object has methods to access a list of all the urls in the site, and a list of all the links for each of these urls. It is also possible to access the title of each url, and a summary generated from each url. The depth of each url can also be accessed; the depth is the minimum number of links from the root URL to that page.

METHODS

WWW::Sitemap->new [ $option => $value ] ...

Constructor. Possible option are:

USERAGENT

User agent used to do the robot traversal. Defaults to LWP::UserAgent.

VERBOSE

Verbose flag, for printing out useful messages during traversal [0|1]. Defaults to 0.

SUMMARY_LENGTH

Maximum length of (automatically generated) summary.

EMAIL

E-Mail address robot uses to identify itself with. This option is required.

DEPTH

Maximum depth of traversal.

ROOT

Root URL of the site for which the sitemap is being created. This option is required.

traverse( \&callback )

sub callback { my( $sitemap, $url, $depth, $flag ) = @_;

    # ...
}

The travese method traverses the sitemap, starting at the root node, and visiting each URL in the order that they would be displayed in a sequential sitemap of the site. The callback is called in a number of places in the traversal, indicated by the $flag argument to the callback:

$flag = 0

Before each set of daughter URLs of a given URL.

$flag = 1

For each URL.

$flag = 2

After each set of daughter URLs of a given URL.

See the sitemapper.pl script distributed with this module for an example of the use of the traverse method.

option( $option [, $value ] )

Iterface to get / set options after object construction.

root()

returns the root URL for the site.

urls()

Returns a list of all the URLs on the sitemap.

links( $url )

Returns a list of all the links from a given URL in the site map.

is_internal_link( $url )

Returns 1 if $url is an internal link for the site - 0 otherwise.

depth( $url )

Returns the minimum number of links to traverse from the root URL of the site to this URL.

title( $url )

Returns the title of the URL.

summary( $url )

Returns a summary of the URL - either from the <META NAME=DESCRIPTION> tag or generated automatically using HTML::Summary.

SEE ALSO

KNOWN BUGS / RESTRICTIONS

AUTHOR

Ave Wrigley <wrigley@cre.canon.co.uk>

COPYRIGHT

Copyright (c) 1997 Canon Research Centre Europe (CRE). All rights reserved. This script and any associated documentation or files cannot be distributed outside of CRE without express prior permission from CRE.