The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Sitemapper - Create text, html and xml sitemap by scanning a web site.

VERSION

version 1.121160

SYNOPSIS

WWW::Sitemapper is meant to be subclassed by user:

    package MyWebSite::Map;
    use Moose;

    use base qw( WWW::Sitemapper );

    # define attributes for your class
    has 'restricted_pages' => (
        is => 'ro',
        isa => 'ArrayRef[RegexpRef]',
        default => sub {
            [
                qr{^/cat/login},
                qr{^/cat/events},
                qr{\?_search_string=},
            ]
        },
    );

    # configuration options for WWW::Robot
    sub _build_robot_config {
        my $self = shift;

        return {
            NAME => 'MyRobot',
            EMAIL => 'me@domain.tld',
        };
    }

    # you need to provide a follow-url-test hook in your subclass
    sub url_test : Hook('follow-url-test') {
        my $self = shift;
        my ($robot, $hook_name, $uri) = @_;

        my $url = $uri->path_query;

        if ( $self->site->host eq $uri->host ) {
            for my $re ( @{ $self->restricted_pages } ) {
                if ( $url =~ /$re/ ) {
                    return 0;
                }
            }

            return 1;
        }

        return 0;
    }

    # you can add your own hooks as well
    sub run_till_first_auto_save : Hook('continue-test') {
        my $self = shift;
        my ($robot) = @_;

        if ( $self->run_started_time + $self->auto_save < DateTime->now ) {
            return 0;
        }
        return 1;
    }


    # as this is your class feel free to define your own methods
    sub ping_google {
        my $self    = shift;

        my $ua = LWP::UserAgent;
        return $ua->get( 'http://www.google.com/webmasters/sitemaps/ping',
            sitemap => $self->site .'google-sitemap.xml.gz'
        );
    }

and then

    package main;

    my $mapper = MyWebSite::Map->new(
        site => 'http://mywebsite.com/',
        status_storage => 'sitemap.data',
        auto_save => 10,
    );

    $mapper->run;


    open(HTML, ">sitemap.html") or die ("Cannot create sitemap.html: $!");
    print HTML $mapper->html_sitemap;
    close(HTML);

    my $xml_sitemap = $mapper->xml_sitemap(
        priority => '0.7',
        changefreq => 'weekly;
    );

    $xml_sitemap->write('google-sitemap.xml.gz');

    # call your own method
    $mapper->ping_google();

and while mapper is still running take a peek what has been mapped so far

    my $mapper = MyWebSite::Map->new(
        site => 'http://mywebsite.com/',
        status_storage => 'sitemap.data',
    );

    $mapper->restore_state();

    print $mapper->txt_sitemap();

ATTRIBUTES

site

Home page of the website to be mapped.

isa: "tURI" in WWW::Sitemapper::Types.

tree

Tree structure of the web site.

isa: WWW::Sitemapper::Tree.

Note: each page is mapped only once, so if multiple pages are linking to the same page only the first will be counted as parent.

Note: beware of pages serving same content under different URLs (eg. using different query string parameters) as it may lead to circular references. Besides this search engines will punish you for so called "duplicate content". Use your subroutine with Hook('follow-url-test') to restrict access to those pages.

robot_config

WWW::Robot configuration options.

isa: HashRef.

You need to define in your subclass builder method _build_robot_config which needs to return a hashref. Most important options are:

  • EMAIL

    Your e-mail address - in case someone wishes to complain about the behaviour of your robot.

    mandatory.

  • DELAY

    Delay between each request in minutes.

    Default: 1

For more details and other options please see "ROBOT_ATTRIBUTES" in WWW::Robot.

status_storage

Path of status storage file to be used for saving the result of web crawl. If defined Storable will be used to store the current state.

isa: Str.

auto_save

Auto save current status every N minutes (defaults to 0 - do not auto save).

isa: "tDateTimeDuration" in WWW::Sitemapper::Types.

Note: "status_storage" has to be defined.

run_started_time

Time when "run" method was called.

isa: "tDateTime" in WWW::Sitemapper::Types.

html_sitemap_template

Template-Toolkit html sitemap template to be used by helper method "html_sitemap".

isa: Str.

Can be overriden by definining _build_html_sitemap_template in your subclass.

Parameter passed to the template is the main object ($self) named as mapper.

Default value:

    <html>
    <head>
    <title>Sitemap for [% mapper.site.host %]</title>
    </head>
    <body>
    <ul>
    [%- INCLUDE branch node = mapper.tree -%]
    </ul>
    </body>
    </html>

    [%- BLOCK branch -%]
    <li><a href="[% node.loc %]">[% node.title || node.loc %]</a>
    [%     IF node.children.size -%]
    <ul>
    [%-
                FOREACH child IN node.children;
                    INCLUDE branch node = child;
                END;
    -%]
    </ul>
    [%     END -%]
    </li>
    [% END -%]

METHODS

run

    print $mapper->run();

Creates a WWW::Robot object and starts to map the website specified by "site".

Scans your subclass for methods with :Hook('name-of-the-hook') attributes to be added to robot object.

You need to define at least one subroutine with follow-url-test hook which will be used to decide if the page should be followed and added to sitemap.

    sub url_test : Hook('follow-url-test') {
        my $self = shift;
        my ($robot, $hook_name, $uri) = @_;

        my $should_follow = ...

        return $should_follow;
    }

Please see "SUPPORTED_HOOKS" in WWW::Robot for full list of supported hooks.

Note: you can name your subroutines however you want and add other attributes as well - WWW::Sitemapper looks only for Hook(...) ones.

txt_sitemap

    print $mapper->txt_sitemap();

Create plain text sitemap. Example output:

    * http://mywebsite.com/
      * http://mywebsite.com/page1.html
        * http://mywebsite.com/page11.html
        * http://mywebsite.com/page12.html
      * http://mywebsite.com/page2.html

Accepts following parameters:

with_id => 0|1
    print $mapper->txt_sitemap( with_id => 1 );

Use id of each node instead of *.

Defaults to 0.

with_title => 0|1
    print $mapper->txt_sitemap( with_title => 1 );

Add node title after node location.

Defaults to 0.

html_sitemap

    print $mapper->html_sitemap(%TT_CONF);

Create HTML sitemap using template defined in "html_sitemap_template".

Allows to specify Template-Toolkit configuration options, see "CONFIGURATION_SUMMARY" in Template.

xml_sitemap

    my $sitemap = $mapper->xml_sitemap();

    # print xml
    print $sitemap->as_xml->sprint;

    # write to file

    $sitemap->write('sitemap.xml');

Create XML sitemap. Returns WWW::Sitemap::XML object.

Accepts following parameters:

  • split_by

        my @sitemaps = $mapper->xml_sitemap(
            split_by => [
                '^/doc',
                '^/cat',
                '^/ila',
            ],
        );

    Arrayref of regular expressions used to split the final sitemap based on the page location - "loc" in WWW::Sitemapper::Tree. If this option is supplied the "xml_sitemap" will return an array of WWW::Sitemap::XML objects plus additional one for any urls not matched by conditions provided.

    Note: the first matching condition is used.

    Note: schema and hostname are remove from node uri for condition matching.

    Note: keys could be regexp or strings.

  • priority

        my $sitemap = $mapper->xml_sitemap(
            priority => 0.6,
        );

    or

        my $sitemap = $mapper->xml_sitemap(
            priority => {
                '^/doc/' => '+0.2', # same as 0.7
                '^/ila/' => 0.4,
                '^/cat/' => 0.9,
                '^/$' => 1,
            },
        );

    or

        my $sitemap = $mapper->xml_sitemap(
            priority => [
                { '^/doc/' => '+0.2' },
                { '^/ila/' => 0.3    },
                { '^/cat/' => 0.9    },
                { '\.pdf$' => 0.8    }, # all pdfs 0.8 and in /doc/ 1.0
            ],
        );

    If priority is a scalar value it will be used as a default for all pages.

    Supports relative values which will be added/subtracted to/from final priority.

    If it is a hashref or arrayref all conditions are checked. In case of relative values all matching ones are combined and in case of absolute ones the last one is used - use arrayref to chain your conditions.

    Final priority will be set to 0.0 if the calculated one is negative.

    Final priority will be set to 1.0 if the calculated one is higher then 1.

    Default priority is 0.5.

    Note: schema and hostname are remove from node uri for condition matching.

    Note: keys could be regexp or string objects.

  • changefreq

        my $sitemap = $mapper->xml_sitemap(
            changefreq => 'daily',
        );

    or

        my $sitemap = $mapper->xml_sitemap(
            changefreq => {
                '^/doc/' => 'weekly',
                '^/ila/' => 'yearly'
                '^/cat/' => 'daily',
                '^/$' => 'always',
            },
        );

    or

        my $sitemap = $mapper->xml_sitemap(
            changefreq => [
                { '^/doc/' => 'weekly' },
                { '^/ila/' => 'yearly' },
                { '^/cat/' => 'daily'  },
                { '^/$' => 'always'    },
                { '\.pdf$' => 'never'  }, # pdfs will never change
            ],
        );

    If changefreq is a scalar value it will be used as a default for all pages.

    If it is a hashref or arrayref all conditions are checked and the last matching one is used - use arrayref to chain your conditions.

    Valid values are:

    • always

    • hourly

    • daily

    • weekly

    • monthly

    • yearly

    • never

    Default changefreq is 'weekly'.

    Note: schema and hostname are remove from node uri for condition matching.

    Note: keys could be regexp or string objects.

HOOKED METHODS

restore_state

    $mapper->restore_state();

Restore state from "status_storage" using "retrieve" in Storable.

Loads into current object "tree" and internal state of web robot.

Uses hook "restore-state" in WWW::Robot.

save_state

    $mapper->save_state();

Save into "status_storage" using "store" in Storable current content of "tree" and internal state of web robot.

Uses hook "save-state" in WWW::Robot.

AUTHOR

Alex J. G. Burzyński <ajgb@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Alex J. G. Burzyński <ajgb@cpan.org>.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.