NAME

WWW::CheckSite::Manual - A description of the metrics used in this package

SYNOPSIS

This document contains a description of modules and tools in this suite.

Kwalitee
checksite
WWW::CheckSite

DESCRIPTION

Kwalitee

The idea behind this package is to provide an analysis of items contained in a web-site. We use the word kwalitee because it looks and sounds like quality but just isn't. The metrics used to assess kwalitee only give an indication of the technical state a web-site is in, and do not reflect on the user experience of quality of that web-site.

At the heart of the package is the spider that fetches all the pages referred to within the web-site. For each page that is fetched a number of things is checked. Here is an explanation of the kwalitee metrics:

* return status

The most basic check for a web-page is to see if it can be fetched. The HTTP return-status should be 200 OK.

SCORE: 0 for return status other than 200; 1 for return status 200

* title

The next check is to see if the <title></title> tag-pair has content.

SCORE: 0 for not content; 1 for content

* valid

The next check is to see if the (X)HTML in the page validates. The default behaviour is to use the validator available on http://validator.w3.org

SCORE: 0 for not valid, 1 for valid or validation disabled

The next check is to see if the web-page does not contain "dead links".

All hyperlinks (<a href=>, <area href=>) are checked with a HTTP HEAD request to see if they can be "followed". URLs that have the same origin as the primary url will also be put on the "to-fetch-list" of the spider.

MAX SCORE: 1 (do not count urls excluded by robot-rules/exclude pattern)

* images

The next check is to see if the web-page does not contain "dead images".

All images (<img src=>, <input type=image>) are checked with a HTTP HEAD request to see if they exist on the server. If the Image::Info module is available, the image is fetched from the server and a basic sanity test on the image is done.

MAX SCORE: 1 (do not count images excluded by robot-rules/exclude pattern)

* styles

The next check is to see if the web-page does not contain "dead style references".

All styles referenced in <link rel=stylesheet type=text/css> are fetched and if validation is switched on, they will be sent to the css-validator at: http://jigsaw.w3.org/validator

TODO: Extract inline styles, and send them of for validation.

MAX SCORE: 1

kwalitee

Every individual page can have a maximum of 6 kwalitee points that lead to a kwalitee of 1.00. For the complete web-site the mean of the page scores is taken and presented as a fraction of 1.

checksite

This script is a wrapper around WWW::CheckSite that supports some command-line options to tweak the behaviour of the module.

Here is an explanation of these options:

[--uri|-u] <uri> (mandatory unless --load)

This specifies the uri to be spidered. The --uri option-qualifier is optional. --uri can be abbreviated to -u.

--prefix|-p <prefix> (mandatory)

This option specifies a prefix that will be used as a subdirectory name which is used to store the saved spider data and the reports. --prefix can be abbreviated to -p.

The subdirectory is created the current directory, or in the directory specified with the --dir option. The data stored as a result of the --save option will be in this subdirectory with the name <prefix>.wcs

--dir|-d <directory>

This option specifies the base directory for storing the data. --dir can be abbreviated to -d.

--save or --nosave

This option specifies that the spider data should be saved. The default behaviour is to save the data, if you do not want that, use --nosave. The saved data can later be used to regenerate the reports with the --load option. The data is stored as <directory>/<prefix>/<prefix>.wcs with Storable::nstore(). --[no]save cannot be abbreviated.

See also: WWW::CheckSite Report-Templates

--load

This options specifies that you want to load the results of a previous run and not do an actual run of the programme. This option is useful to regenerate the reports. --load cannot be abbreviated.

See also: WWW::CheckSite Report-Templates

--html or --nohtml

This option specifies if (X)HTML-validation should be done. The default behaviour is to validate by_upload (see --html_upload). If you do not want the validation, use the --nohtml option. --[no]html cannot be abbreviated.

See also: checksite --html_uri, --html_upload, --xmllint and --html_validator

--html_validator <w3c-validator-uri>

As of version 0.20, the (X)HTML-validator at W3C is no longer used as the validator for (X)HTML as they do not allow robots!

The default w3c-validator-uri is now http://localhost/w3c-validator/. It is strongly advised to run your own copy of the W3C validator. --html_validator cannot be abbreviated.

The W3C (X)HTML-validator is widly available and runs smoothly on most systems with Apache and Perl running. See http://validator.w3.org/source/ for more information.

--html_uri

This option sets the validation method to use the uri interface (unless --nohtml is specified). You can optionally specify an alternative (X)HTML-validator site with --html_validator. --html_uri cannot be abbreviated.

--html_upload

This option sets the validation method to use the upload interface (unless --nohtml is specified). All the content to be validated is saved as a local file (using File::Temp). --html_upload cannot be abbreviated.

--xmllint <path/to/xmllint>

This option specifies that the validation of (X)HTML should be done the xmllint(1) program (unless --nohtml is specified). You can optionally specify the full path to your xmllint program. --xmllint cannot be abbreviated.

--css or --nocss

This option specifies if CSS-validation should be done. The default behaviour is to validate by_upload (see --css_upload). If you do not want the validation, use the --nocss option. --[no]css cannot be abbreviated.

See also: checksite --css_uri, --css_upload and --css_validator

--css_validator <css-validator-uri>

As of version 0.20, the CSS-validator at W3C is no longer used as the validator for CSS as they do not allow robots!

The default w3c-validator-uri is now http://localhost/css-validator/. It is strongly advised to run your own copy of the W3C validator. --css_validator cannot be abbreviated.

The W3C CSS-validator is available and runs under Jigsaw on most systems with a working java JDK. See http://www.w3.org/Jigsaw/#Getting for more information on Jigsaw applet server, and http://jigsaw.w3.org/css-validator/DOWNLOAD.html for more information on the W3C CSS-validator.

--css_uri

This option sets the validation method to use the uri interface (unless --nocss is specified). You can optionally specify an alternative CSS-validator site with --css_validator. --css_uri cannot be abbreviated.

--css_upload

This option sets the validation method to use the upload interface (unless --nocss is specified). All the content to be validated is saved as a local file (using File::Temp). --css_upload cannot be abbreviated.

--lang|-l <accept-language>

This option can be used to force a web-server to return web-pages in the specified language (if applicable). The accept-language argument can be a simple two letter language code as specified in ISO 639, or a complete Accept-language: field as described in section 14.4 of RFC 2616.

NOTE: My apache config says:

  # Note 3: In the case of 'ltz' we violate the RFC by using a three
  # char specifier. There is 'work in progress' to fix this and get
  # the reference data for rfc1766 cleaned up.

So there may be more weird stuff out there, but since you are supposed to be using this on your own web-sites only, you should know about that!

--lang can be abbreviated to -l.

--ua_class <ua_class>

This option can be used to override the default user-agent class WWW::Mechanize. The new user-agent class could be a WWW::Mechanize descendant that caters for your special needs:

    package BA_Mech;
    # This package sets credentials for basic authentication
    use base 'WWW::Mechanize';
    sub get_basic_credentials { ( 'abeltje', '********' ) }
    1;

and call checksite like

    checksite -p mysite --ua_class BA_Mech http://www.mysite.org
--verbose|-v (multiple)

Each --verbose option increases the verbosity. When $v==1 you will see the messages from WWW::CheckSite and when $v==2 you will also see the messages from WWW::CheckSite::Valiadator and WWW::CheckSite::Spider.

configuration file

The checksite program supports Config::Auto. This means you can specify any of the commandline arguments as options (without the prefixing dashes) in a file.

The files searched are (and in this order):

./checksiteconfig
./checksite.config
./checksiterc
./.checksiterc
<bindir>/checksiteconfig
<bindir>/checksite.config
<bindir>/checksiterc
<bindir>/.checksiterc
$HOME/checksiteconfig
$HOME/checksite.config
$HOME/checksiterc
$HOME/.checksiterc
/etc/checksiteconfig
/etc/checksite.config
/etc/checksiterc
/etc/.checksiterc
/urs/local/etc/checksiteconfig
/urs/local/etc/checksite.config
/urs/local/etc/checksiterc
/urs/local/etc/.checksiterc

WWW::CheckSite

The WWW::CheckSite module uses the WWW::CheckSite::Validator module to get information about a website and assess its kwalitee. The findings are presented in two html reports, one with all the information and one with just the "errors".

The reports are created with the use of templates. The module caters for two template systems: Template (TT2) and HTML::Template. The template-toolkit templates are prefered if both modules are installed.

Your own report templates

The report templates have the base names: wcsfullrpt.EXT and wcssummrpt.EXT, where EXT eq 'tt' for template-toolkit and EXT eq 'tmpl' for HTML::Template.

First the current directory is searched, then directory where checksite is installed and finally the directory where the WWW::CheckSite module is installed (and where the default templates are). If you put your own templates in one of the first two directories, they will override the default templates.

Saving and loading validation data

Saving the validation data can help you develop your own templates.

AUTHOR

Abe Timmerman, <abeltje@cpan.org>

$Id: Manual.pod 675 2007-05-28 21:58:52Z abeltje $

COPYRIGHT & LICENSE

Copyright MMV-MMVII Abe Timmerman, All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 75:

Expected text after =item, not a bullet

Around line 86:

Expected text after =item, not a bullet