The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Sitemap - functions for generating a site map for a given site URL.

SYNOPSIS

    use WWW::Sitemap;
    use LWP::UserAgent;

    my $ua = new LWP::UserAgent;
    my $sitemap = new WWW::Sitemap(
        EMAIL       => 'your@email.address',
        USERAGENT   => $ua,
        ROOT        => 'http://www.my.com/'
    );

    $sitemap->url_callback(
        sub {
            my ( $url, $depth, $title, $summary ) = @_;
            print STDERR "URL: $url\n";
            print STDERR "DEPTH: $depth\n";
            print STDERR "TITLE: $title\n";
            print STDERR "SUMMARY: $summary\n";
            print STDERR "\n";
        }
    );
    $sitemap->generate();
    $sitemap->option( 'VERBOSE' => 1 );
    my $len = $sitemap->option( 'SUMMARY_LENGTH' );

    my $root = $sitemap->root();
    for my $url ( $sitemap->urls() )
    {
        if ( $sitemap->is_internal_url( $url ) )
        {
            # do something ...
        }
        my @links = $sitemap->links( $url );
        my $title = $sitemap->title( $url );
        my $summary = $sitemap->summary( $url );
        my $depth = $sitemap->depth( $url );
    }
    $sitemap->traverse(
        sub {
            my ( $sitemap, $url, $depth, $flag ) = @_;
            if ( $flag == 0 )
            {
                # do something at the start of a list of sub-pages ...
            }
            elsif( $flag == 1 )
            {
                # do something for each page ...
            }
            elsif( $flag == 2 )
            {
                # do something at the end of a list of sub-pages ...
            }
        }
    )

DESCRIPTION

The WWW::Sitemap module creates a sitemap for a site, by traversing the site using the WWW::Robot module. The sitemap object has methods to access a list of all the urls in the site, and a list of all the links for each of these urls. It is also possible to access the title of each url, and a summary generated from each url. The depth of each url can also be accessed; the depth is the minimum number of links from the root URL to that page.

CONSTRUCTOR

WWW::Sitemap->new [ $option => $value ] ...

Possible option are:

USERAGENT

User agent used to do the robot traversal. Defaults to LWP::UserAgent.

VERBOSE

Verbose flag, for printing out useful messages during traversal [0|1]. Defaults to 0.

SUMMARY_LENGTH

Maximum length of (automatically generated) summary.

EMAIL

E-Mail address robot uses to identify itself with. This option is required.

DEPTH

Maximum depth of traversal.

ROOT

Root URL of the site for which the sitemap is being created. This option is required.

    my $sitemap = new WWW::Sitemap(
        EMAIL       => 'your@email.address',
        USERAGENT   => $ua,
        ROOT        => 'http://www.my.com/'
    );

METHODS

generate( )

Method for generating the sitemap, based on the constructor options.

    $sitemap->generate();

url_callback( sub { ... } )

This method allows you to define a callback that will be invoked on every URL that is traversed while generating the sitemap. This is basically to allow bespoke verbose reporting. The callback should be of the form:

    sub {
        my ( $url, $depth, $title, $summary ) = @_;

        # do something ...

    }

option( $option [ => $value ] )

Iterface to get / set options after object construction.

    $sitemap->option( 'VERBOSE' => 1 );
    my $len = $sitemap->option( 'SUMMARY_LENGTH' );

root()

returns the root URL for the site.

    my $root = $sitemap->root();

urls()

Returns a list of all the URLs on the sitemap.

    for my $url ( $sitemap->urls() )
    {
        # do something ...
    }

is_internal_url( $url )

Returns 1 if $url is an internal URL (i.e. if $url =~ /^$root/.

    if ( $sitemap->is_internal_url( $url ) )
    {
        # do something ...
    }

links( $url )

Returns a list of all the links from a given URL in the site map.

    my @links = $sitemap->links( $url );

title( $url )

Returns the title of the URL.

    my $title = $sitemap->title( $url );

summary( $url )

Returns a summary of the URL - either from the <META NAME=DESCRIPTION> tag or generated automatically using HTML::Summary.

    my $summary = $sitemap->summary( $url );
    

depth( $url )

Returns the minimum number of links to traverse from the root URL of the site to this URL.

    my $depth = $sitemap->depth( $url );

traverse( \&callback )

The travese method traverses the sitemap, starting at the root node, and visiting each URL in the order that they would be displayed in a sequential sitemap of the site. The callback is called in a number of places in the traversal, indicated by the $flag argument to the callback:

$flag = 0

Before each set of daughter URLs of a given URL.

$flag = 1

For each URL.

$flag = 2

After each set of daughter URLs of a given URL.

See the sitemapper.pl script distributed with this module for an example of the use of the traverse method.

    $sitemap->traverse(
        sub {
            my ( $sitemap, $url, $depth, $flag ) = @_;
            if ( $flag == 0 )
            {
                # do something at the start of a list of sub-pages ...
            }
            elsif( $flag == 1 )
            {
                # do something for each page ...
            }
            elsif( $flag == 2 )
            {
                # do something at the end of a list of sub-pages ...
            }
        }
    );

SEE ALSO

    LWP::UserAgent
    HTML::Summary
    WWW::Robot

AUTHOR

Ave Wrigley <Ave.Wrigley@itn.co.uk>

COPYRIGHT

Copyright (c) 1997 Canon Research Centre Europe (CRE). All rights reserved. This script and any associated documentation or files cannot be distributed outside of CRE without express prior permission from CRE.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 119:

You forgot a '=back' before '=head1'