The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::CGIChecker - A Perl module to detect dangerous HTML code

SYNOPSIS

        use HTML::CGIChecker;
        
        $feedback = '
        <TABLE CELLPADDING="2"><TR><TD>One column</TD></TR></TABLE><BR>
        " Arrays & variables "
        
        Dough > Hi, how are you ?
        
        And now some Perl code:
        <PRE>
                print "<HTML><BODY></BODY></HTML>";
        </PRE>
        ';

        # create the $checker object
        
        $checker = new HTML::CGIChecker (
                mode => 'allow',
                allowclasses => [ qw( tables images ) ],
                allowtags => [ qw ( B I A U STRONG BR HR ) ],   
                jscript => 0,
                html => 0,
                pre => 1,
                debug => 0,
                err_tag => 'Tag {tag} is not allowed in {element}.'
        );

        # Now you can use it to check any string using its checkHTML()
        # method. It "remembers" its configuration, so you can reuse it.

        ($checked_feedback, $Warnings) = 
                $checker->checkHTML ($feedback);

        # Process the results ...

        if ($checked_feedback) {
                # save $checked_feedback to the database ....
        } 
        else {
                # print the warnings ...
                print join ("\n", @{$Warnings});
        }

The example above produces no warning messages and returns $feedback checked and properly HTML escaped. The only HTML "error" - the unescaped ">" bracket on the fourth line - is autocorrected. One warning message was overriden by a customized version. Potential warnings would not be HTML formatted and HTML safe, because the 'html' parameter is not true.

MOTIVATION

Almost every modern website needs some way to get feedback from its users in form of comments that are also visible to other visitors. It is convenient to allow the users to use a limited set of HTML tags in their posts to embold text, create hyperlinks or even include images.

The problem araises when the user posts HTML code that breaks the page on which the post is displayed. You must check the posts for dangerous HTML errors and javascripts to prevent a malicious user to render the rest of the page unusable. This module has been created to fulfill this function and also to provide some extra features.

Typical HTML validators do not suit well for the above mentioned purpose, because they are way too strict and do not scale well. A small and fast checker that also allows a programmer to deny and allow tags on an individual basis comes as a solution of this problem. Another problem one has to solve while creating a web site that allows HTML user posts is to escape these posts correctly before storing them to the database and displaying them to other users.

The currently available module HTML::QuickCheck that should fill the same purpose does not offer some crucial features:

Checking of correct quoting - this problem can be fatal, because the common typo when one forgets to close quotes in for example a HREF parameter almost always totally corrupts the rest of a page.

HTML escaping of the right parts of the posts - ie. of the non-HTML parts.

Denying/allowing of javascripts.

Denying/allowing of images, applets, styles, forms and other similar functionality that requires a programmer to be able to deny/allow tags or entire tags classes on an individual basis.

Support for the special formatting PRE tag. Please note that the PRE tag has special meaning for this module. Everything that is placed inside PRE block is automatically HTML escaped. The users can use this behaviour to easily post for example code snippets that contain unescaped HTML brackets. All they need is to place the snippet inside a PRE block. They do not need to worry about escaping of the brackets.

Ability to customize/localize the warning messages that are returned to the user in case when a problematic HTML in his post is detected.

Autocorrection of some "common" errors, for example of chat messages containing unescaped HTML brackets - "peter > how are you ?". Both unmatched opening and closing HTML brackets are autocorrected.

Proper detection of some table closing tags problems that can break the page in some browsers.

Conversion of images to appropriate hyperlinks.

Automatic prepending of "http://" to URLs which do not start with "http://".

DESCRIPTION

HTML::CGIChecker is a module for web developers to parse HTML and to detect HTML code that could break a page in some way. This module is not a HTML validator, but it allows one to check the HTML code that users post to a web application, for example to a discussion board, to prevent them to post a piece of code that would render the rest of a page it is displayed on unusable.

Using it one also can deny javascripts, images, tables or any other tags on an individual basis. It also can check for correct quoting and correct URLs.

The module can autocorrect some common bad users' behaviour, for example the use of unescaped HTML brackets in a chat room, etc.

It is easy to use and very useful in any CGI application in which you want its users to be able to use HTML in their posts to some customizable extent. It is object oriented and designed to be easily extensible.

This is not a validator, for validation you need an other solution. This module does not care about correctness of the parsed HTML code at all. All it does care about is whether the HTML code could break a page. HTML tags that are not paired correctly or that cannot be rendered at all can pass this checker. All the names of elements and attributes are not case sensitive.

The checker object is created by calling new() constructor of HTML::CGIChecker class.

        $checker = new HTML::CGIChecker (
                mode => 'allow',
                ....
        );

Then you can use the checkHTML() instance method to perform a check on a string using the settings this object has been configured with.

        ($checked_string, $Warnings) = 
                $checker->checkHTML ($string);

new() - the constructor

Creates and returns a new checker object that can be configured with parameters that are described below. Default configuration allows only a few harmless inline tags to be used in the HTML code:

    B I A U STRONG BR
    EM CITE VAR ABBR Q DFN CODE SUB SUP SAMP KBD ACRONYM

Other tags except the special PRE tag are not allowed. Javascripts are by default also not allowed.

The various parameters are passed in as a list of parameter => value pairs. List of these parameters together with their default values follows:

        mode => 'allow'
        allowclasses => []
        allowtags => [ qw ( 
        B I A U STRONG BR EM CITE VAR ABBR Q DFN CODE
        SUB SUP SAMP KBD ACRONYM
        ) ]     
        denyclasses => [ keys (%tagclasses) ]
        denytags => [ qw ( FONT ) ]
        jscript => 0
        html => 0
        pre => 1
        img_to_link => 0
        check_http => 1
        debug => 0
        nonpairtags => [ qw (
            IMG HR BR INPUT META AREA COL BASE LINK PARAM
        ) ]
        check_attribs => {}
        err_tag => 'Tag {tag} is not allowed in {element}.'
        err_javascript => 'Javascript is not allowed in {element}.'
        err_quote => 'Missing quote in {element}.'
        err_notclosed => 'Pair tag {tag} was not closed.'
        err_notopened => 'Pair tag {tag} was not opened.'

mode

Two modes are available: allow (default) and deny.

allow: Error is raised if any tag that is not explicitely allowed is found.

deny: Error is raised if an explicitely denied tag is found, any other tags are allowed.

allowclasses, allowtags

These parameters apply only in the 'allow' mode. Here you can specify the tags you allow the user to use. Allowtags must be a reference to an array of tag names. Allowclasses must be a refernce to an array of class names. Tag class (tag group) is a set of tags that can be allowed or denied all at once by allowing or denying the class. These classes are available:

        base        FRAMESET FRAME HTML BODY HEAD TITLE BASE
                    STYLE SCRIPT META NOSCRIPT NOFRAMES
        externals   APPLET OBJECT LINK IFRAME PARAM
        forms       FORM TEXTAREA SELECT INPUT BUTTON LABEL
                    FIELDSET LEGEND OPTGROUP
        tables      TABLE TR TD TBODY THEAD TFOOT TH COLGROUP
                    COL CAPTION
        lists       UL OL LI DL DT DD
        images      IMG MAP AREA
        heading     H1 H2 H3 H4 H5 H6 H7 H8

By default only the above mentioned harmless inline tags are allowed. By default no classes are allowed.

denyclasses, denytags

These parameters apply only in the 'deny' mode. They work similar to the allowclasses and allowtags parameters. By default all above listed classes plus the FONT tag are denied. All other tags are by default allowed in this mode.

jscript

This option disables javascript inside HTML elements. You also must ensure that the SCRIPT tag is not allowed to block the javascript completely.

        0: javascript is not allowed
        1: javascript is allowed
        Default: 0

html

        0: messages will not be in HTML format nor HTML escaped -
       useful for the command line mode
        1: all warning messages will be in HTML versions and also
       HTML escaped
        Default: 0

pre

        0: users will not be allowed to use the special PRE tag
        1: users will be allowed to use the special PRE tag
        Default: 1

img_to_link

        0: do not alter images
        1: convert all images to appropriate links to these
       images: <IMG SRC="url">  ---->  <A HREF="url">url</A>
        Default: 0

check_http

        0: do not alter URLs
        1: prepend "http://" to URLs that do not start
           with "http://", "ftp://" or "mailto:"
        Default: 1
        
        Note: the URLs are recognized only in
    HREF and SRC parameters.

debug

        0: debugging to STDERR is disabled
        1: debugging to STDERR is enabled
        Default: 0      

nonpairtags

The tags that are processed as non-pair can be specified here via a reference to an anonymous array. By default these tags are processed as non-pair:

    IMG HR BR INPUT META AREA COL BASE LINK PARAM

check_attribs

You also can use the check_attribs parameter to allow the user to use only a limited set of attributes in an element. The parameter is a hash reference, that consists of key->value pairs, in which the key is name of an element, and the value is a reference to an array of attributes. For each element specified in this hash, the user will only be allowed to use the specified attributes.

For example, if you define following hash reference:

        check_attribs => {
                        img => [ 'src', 'width', 'height', 'alt' ]
                }

then the user will be allowed to use ONLY the specified attributes in the <IMG> element. Any other elements are not affected and the user will be allowed to use any attributes in them. Names of the elements and of the attributes are not case sensitive.

Warning messages can be redefined by setting these parameters:

        err_tag          = 'Tag {tag} is not allowed in {element}.'
        err_javascript   = 'Javascript is not allowed in {element}.'
        err_quote        = 'Missing quote in {element}.'
        err_notclosed    = 'Pair tag {tag} was not closed.'
        err_notopened    = 'Pair tag {tag} was not opened.'

Messages displayed above are the defaults. Special tokens {tag} and {element} are replaced by the appropriate values. You can redefine these messages to localize them.

checkHTML() - the actual HTML check method

        ($checked_string, $Warnings) = 
                $checker->checkHTML ($string);

This method accepts only one parameter - the actual string to check.

If the string contains anything dangerous or not allowed then this method returns an undefined value and a reference to an array of warning messages that describe the problems that were detected.

If the string is safe then checked and escaped version of the string is returned together with an reference to an empty array.

Please note the warning messages are not returned as an array, but as a reference to an array, that must be dereferenced when you use it as an array. Usual way to print all the warnings is using the join() function:

        print join ("<BR>\n", @{$Warnings});

SUPPORT

No official support is provided, but I welcome any comments, patches and suggestions on my email. If you suggest a new feature, please justify how it will help the purpose of this module - to provide fast checking for HTML code that breaks pages.

BUGS

I am aware of no bugs. But remember, this is NOT a validator - bad HTML may and will pass it. Please let me know if you find any chunk of code that passes it and also breaks a page.

AVAILABILITY

        http://www.geocities.com/tripiecz/

AUTHOR

Tomas Styblo, tripiecz@yahoo.com

Prague, the Czech republic

LICENSE

HTML::CGIChecker - A Perl module to detect dangerous HTML code

Copyright (C) 2000 Tomas Styblo (tripiecz@yahoo.com)

This module is free software; you can redistribute it and/or modify it under the terms of either:

a) the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version, or

b) the "Artistic License" which comes with this module.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See either the GNU General Public License or the Artistic License for more details.

You should have received a copy of the Artistic License with this module, in the file Artistic. If not, I'll be glad to provide one.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

SEE ALSO

perl(1).