The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

WWW::CheckSite::Validator - A spider that assesses 'kwalitee' for a site

SYNOPSIS

    use WWW::CheckSite::Validator;
    my $wcv = WWW::CheckSite::Validator->new(
        uri => 'http://www.test-smoke.org'
    );

    while ( my $info = $wcv->get_page ) {
        # handle the info
    }

DESCRIPTION

This is a subclass of WWW::CheckSite::Spider.

WWW::CheckSite::Validator starts its work after the spider has fetched the page. It will check these things:

  • links

    All links on the page (<a href>, <area href>, <frame src>) are checked for availability.

  • images

    All images on the page (<img src>, <input type=image>) are checked for availability.

  • stylesheets

    All stylesheets on the page (<link rel=stylesheet type=text/css>) are checked for availability.

  • W3 HTML validation

    The contents of the page are send to http://validator.w3.org for validation.

METHODS

WWW::CheckSite::Validator->new( %args )

Extend WWW::CheckSite::Spider->new to check for Image::Info so we can do a basic check on the images.

$wcs->process_page

This method overrides the WWW::CheckSite::Spider::process_page() method to check on the availability of links, images and stylesheets. When specified it will also send the page for validation by W3.ORG.

On top of the standard information it returns more:

  • links a list of links on the page, with some extra info

  • links_cnt the number of links on the page

  • links_ok the number of links that returned STATUS==200

  • images a list of images on the page, with some extra info

  • images_cnt the number of images on the page

  • images_ok the number of images that returned STATUS==200

  • styles a list of stylesheets on the page, with some extra info

  • styles_cnt the number of stylesheets on the page

  • styles_ok the number of stylesheets that returned STATUS==200

  • valid the result of validation at W3.ORG

$wcs->check_links( $stats )

The check_links() method gets information about the links on this page. If there is no return status, it will HEAD the uri and update the cache status for this link to prevent multiple HEADing.

NOTE: This method does not respect the exclusion rules, and only robot-rules with strictrules enabled!

The structure for links:

  • link as set in the a/area tag

  • uri as returned after the HEAD request

  • tag set to 'A' or 'AREA'

  • text set to the text in the link

  • status the return status from the HEAD request

  • depth the depth in the "browse-tree"

  • action explanation of the action taken on this uri

$wcs->check_images( $stats )

The check_images() method gets information about the images on the page. The list comes from the images() method of the mechanize object. It will only HEAD the uri.

The structure for images:

  • link as set in the img/input tag

  • uri as returned after the HEAD request

  • tag set to 'ALT'

  • text set to the text of the ALT attribute

  • status the return status from the HEAD request

  • ct the 'Content-Type' returned by the HEAD request

$wcs->check_styles( $stats )

The check_styles() method checks the validity of stylesheets used in the page. We check for <link rel="stylesheet" type="text/css"> tags.

The structure for stylesheets:

  • link as set in the link tag

  • uri as returned after the HEAD request

  • tag set to 'link'

  • text set to empty for compatibility with links and images

  • status the return status from the HEAD request

  • ct the 'Content-Type' returned by the HEAD request

$wcs->validate

The validate() method sends the url/contents off to W3.org to validate.

$wcs->validate_by_none

The fallback do-not-validate method.

$wcs->validate_by_uri

Sends only the uri to W3.ORG and get the validation result.

$wcs->validate_by_upload( $stats )

Create a temporary file (with File::Temp) from $agent->content, call the validator with that temporary file and save the result (as a boolean) in $stats->{validate}.

$wcs->validate_by_xmllint( $stats )

Use the xmllint(1) program to validate the (X)HTML.

$wcs->validate_style( $ua )

Dispatch the validation to the right method.

$wcs->style_by_none

The fallback do-not-validate-stylesheet method.

$wcs->style_by_uri( $ua )

Sends only the uri to JIGSAW.W3.ORG and get the validation result.

$wcs->style_by_upload( $ua )

Create a temporary file (with File::Temp) from $ua->content, call the validator with that temporary file and return the result.

$wcs->validate_image( $ua )

This is more like a basic consistency check, that uses Image::Info::image_info().

$wcs->ct_can_validate( $ua )

Check if the content-type is "validatable".

$wcs->set_action

Why?

SEE ALSO

WWW::CheckSite::Spider, WWW::CheckSite

AUTHOR

Abe Timmerman, <abeltje@cpan.org>

BUGS

Please report any bugs or feature requests to bug-WWW-CheckSite@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

COPYRIGHT & LICENSE

Copyright MMV Abe Timmerman, All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.