The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::Valid::Tagset - data tables useful in parsing HTML

VERSION

This document describes HTML::Valid::Tagset version 0.00_04. This corresponds to HTML Tidy version 5.0.0.

SYNOPSIS

  use HTML::Valid::Tagset;
  # Then use any of the items in the HTML::Valid::Tagset package
  #  as need arises

DESCRIPTION

This module contains several data tables useful in various kinds of HTML parsing operations.

All tag names used are lowercase.

This module and HTML::Tagset

This is a drop-in replacement for HTML::Tagset. However, HTML::Valid::Tagset is not a fork of HTML::Tagset. It uses the tables of HTML elements from a C program called "HTML Tidy" (this is not the Perl module HTML::Tidy).

As far as it makes sense to do so, this module tries to be compatible with HTML::Tagset. Some errors in HTML::Tagset have been corrected in this module. See "Issues with HTML::Tagset".

Validation

If you need to validate tags, you should use, for example, "%isHTML5" for HTML 5 tags, or "%isKnown" if you want to check whether a tag is a known one.

Terminology

In the following documentation, a "hashset" is a hash being used as a set -- the hash conveys that its keys are there, and the actual values associated with the keys are not significant. (But what values are there, are always true.)

VARIABLES

None of these variables are exported by default. See "EXPORTS". The compatibility with HTML::Tagset is listed. In all cases, the compatibility with HTML::Tagset refers to HTML::Tagset version 3.20.

@allTags

This contains all the tags as an array. It is exactly the same thing as the keys of "%isKnown".

%canTighten

This is a copy of the code from HTML::Tagset.

%emptyElement

This hashset has as values the tag names of elements that cannot have content. For example, "base", "br", or "hr".

    use HTML::Valid::Tagset '%emptyElement';
    for my $tag (qw/hr dl br snakeeyes/) {
        if ($emptyElement{$tag}) {
            print "<$tag> is empty.\n";
        }
        else {
            print "<$tag> is not empty.\n";
        }
    }
    

outputs

    <hr> is empty.
    <dl> is not empty.
    <br> is empty.
    <snakeeyes> is not empty.

This is compatible with HTML::Tagset.

%isBlock

This hashset contains all block elements.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isBodyElement

This hashset contains all elements that are to be found only in/under the "body" element of an HTML document.

This is compatible with the undocumented %HTML::Tagset::isBodyElement in HTML::Tagset and the documentation for %HTML::Tagset::isBodyMarkup. See also "Issues with HTML::Tagset". %isBodyMarkup is not implemented in HTML::Tagset, so it's not provided for compatibility here.

hashset %isCDATA_Parent

This hashset includes all elements whose content is CDATA.

%isFormElement

This hashset contains all elements that are to be found only in/under a "form" element.

This is compatible with HTML::Tagset.

%isHeadElement

This hashset contains elements that can be present in the 'head' section of an HTML document.

This is compatible with the contents of %HTML::Tagset::isHeadElement, but not its documentation. See also "Issues with HTML::Tagset".

%isHeadOrBodyElement

This hashset includes all elements that can fall either in the head or in the body.

This is compatible with HTML::Tagset.

%isHTML2

This is true for elements which are in HTML 2.0, as described in RFC 1866. See http://www.ietf.org/rfc/rfc1866.txt.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isHTML3

This is true for elements which are in HTML 3.2, as described in the HTML 3.2 Reference Specification. See http://www.w3.org/TR/REC-html32.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isHTML4

This is true for elements which are in HTML 4.01, as described in the HTML 4.01 Specification. See http://www.w3.org/TR/html401/.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isHTML5

    use utf8;
    use FindBin '$Bin';
    use HTML::Valid::Tagset '%isHTML5';
    if ($isHTML5{canvas}) {
        print "<canvas> is OK.\n"; 
    }
    if ($isHTML5{a}) {
        print "<a> is OK.\n";
    }
    if ($isHTML5{plaintext}) {
        print "OH NO!"; 
    }
    else {
        print "<plaintext> went out with scrambled eggs.\n";
    }

outputs

    <canvas> is OK.
    <a> is OK.
    <plaintext> went out with scrambled eggs.

This is true for elements which are valid in HTML5. See http://www.w3.org/TR/html5/, https://whatwg.org/, and, for an easier introduction, http://diveintohtml5.info/. It is not true for obsolete elements like the <plaintext> tag (see "%isObsolete"), or proprietary elements such as the <blink> tag which have never been part of any HTML standard (see "%isProprietary"). Further, some elements neither marked as obsolete nor proprietary are also not present in HTML5. For example the <isindex> tag is not present in HTML5.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isKnown

This hashset lists all known HTML elements. See also "@allTags".

This is compatible with HTML::Tagset.

%isList

This hashset contains all elements that can contain "li" elements.

This is compatible with HTML::Tagset.

%isInline

This hashset contains all inline elements. It is identical to %isPhraseMarkup.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isObsolete

    $isObsolete{canvas};
    # Undefined
    $isObsolete{plaintext};
    # True

This is true for HTML elements which were once part of HTML standards, like plaintext, but have now been declared obsolete. Note that %isObsolete is not true for elements like the <blink> tag which were never part of any HTML standard. See "%isProprietary" for these tags.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isPhraseMarkup

This hashset contains all inline elements. It is identical to %isInline.

This is compatible with HTML::Tagset.

%isProprietary

This is true for elements which are not part of any HTML standard, but were added by computer companies.

    use utf8;
    use FindBin '$Bin';
    use HTML::Valid::Tagset '%isProprietary';
    my @tags = qw/a blink plaintext marquee/;
    for my $tag (@tags) {
        if ($isProprietary{$tag}) {
            print "<$tag> is proprietary.\n";
        }
        else {
            print "<$tag> is not a proprietary tag.\n";
        }
    }
    

outputs

    <a> is not a proprietary tag.
    <blink> is proprietary.
    <plaintext> is not a proprietary tag.
    <marquee> is proprietary.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isTableElement

This hashset contains all elements that are to be found only in/under a "table" element.

This is compatible with HTML::Tagset.

%optionalEndTag

Elements in this hashset are not empty (see "%emptyElement"), but their end-tags are generally, "safely", omissible.

    use HTML::Valid::Tagset qw/%optionalEndTag %emptyElement/;
    for my $tag (qw/li p a br/) {
        if ($optionalEndTag{$tag}) {
            print "OK to omit </$tag>.\n";
        }
        elsif ($emptyElement{$tag}) {
            print "<$tag> does not ever take '</$tag>'\n";
        }
        else {
            print "Cannot omit </$tag> after <$tag>.\n";
        }
    }
    

outputs

    OK to omit </li>.
    OK to omit </p>.
    Cannot omit </a> after <a>.
    <br> does not ever take '</br>'

This is compatible with HTML::Tagset.

FUNCTIONS

all_attributes

    my $attr = all_attributes ();

This returns an array reference containing all known attributes. The attributes are not sorted.

attributes

    my $attr = attributes ('a');

This returns an array reference containing all valid attributes for the specified tag (as decided by the WWW Consortium). The attributes are not sorted. By default this returns the valid tags for HTML 5.

It is also possible to choose a value for standard which specifies which standard one wants:

    my $attr = attributes ('a', standard => 'html5');

Possible values for standard are

html5

This is the default

html4

This returns valid attributes for HTML 4.01.

html3

This returns valid attributes for HTML 3.2.

html2

This returns valid attributes for HTML 2.0.

COMPATIBILITY-ONLY VARIABLES

These variables are present in this module for compatibility with existing programs which use HTML::Tagset. However, they are fundamentally flawed and should not be used for new projects.

%is_Possible_Strict_P_Content

In HTML::Valid::Tagset, this is identical to "%isInline".

This is a mistake in HTML::Tagset which is preserved in name only for backwards compatibility. See also "Issues with HTML::Tagset".

@p_closure_barriers

In HTML::Valid::Tagset, this resolves to an empty list.

This is a mistake in HTML::Tagset which is preserved in name only for backwards compatibility. See also "Issues with HTML::Tagset".

UNIMPLEMENTED

The following parts of HTML::Tagset are not implemented in version 0.00_04 of HTML::Valid::Tagset.

%boolean_attr

This is not implemented in HTML::Valid::Tagset.

%linkElements

This is not implemented in HTML::Valid::Tagset.

SEE ALSO

HTML Tidy

HTML Tidy is a program originally by Dave Raggett of the WWW Consortium. See http://www.tidy-html.org/ and the repository at https://github.com/htacg/tidy-html5 for the version of HTML Tidy which HTML::Valid and this module are based on.

CPAN modules

HTML::Tagset, HTML::Element, HTML::TreeBuilder, HTML::LinkExtor

EXPORTS

The hashes and arrays are exported on demand. Everything can be exported with :all:

    export HTML::Valid::Tagset ':all';

BUGS

Issues with HTML::Tagset

There are several problems with HTML::Tagset version 3.20 which mean that it's difficult to be fully compatible with it.

@p_closure_barriers should be an empty set

There is a long-winded argument in the documentation of HTML::Tagset, which has been there since version 3.01, released on Aug 21 2000, about why it's possible for a p element to contain another p element. However, the specification for HTML4.01, which HTML::Tagset seems to be based on, from 1999, states

    The P element represents a paragraph. It cannot contain block-level elements (including P itself).

Thus, it is simply not possible for any block element to legally be part of a paragraph, and the mechanism that HTML::Tagset suggests for how a paragraph element can contain a table which can contain a paragraph element, like this:

     <p>
     <table>

is not and was not legal HTML, since <table> itself is a block level element. (See "%isBlock" for testing for block level elements.)

So in this module, "@p_closure_barriers" is an empty set.

%is_CDATAParent should be true for all XML elements

In HTML::Tagset, this contains true values for the elements script, style, xmp, listing, and plaintext. However, as far as I understand, this is just a mistake, CDATA is only relevant to XML, and any XML element whatsoever can contain a CDATA section. For compatibility, in this module "%is_CDATAParent" returns a list of all the tags.

%is_Possible_Strict_P_Content doesn't really make sense

The comments for HTML::Tagset version 3.20 read

    # I've no idea why there's these latter exceptions.
    # I'm just following the HTML4.01 DTD.

and following this it lists the form tag in this hash. However, the form tag is a block level element, so the purpose of this hash seems to be misguided. Since, as noted above, a p tag can contain any inline element, in this module, for compatibility, "%is_Possible_Strict_P_Content" is just the same thing as "%isInline".

%isBodyMarkup doesn't exist

The documented %isBodyMarkup doesn't exist, in its place is %isBodyElement.

This is reported as https://rt.cpan.org/Public/Bug/Display.html?id=109024.

The documentation of %isHeadElement is misleading

The documentation of %isHeadElement claims

    This hashset contains all elements that elements that should be present only in the 'head' element of an HTML document.

However, in fact it actually contains elements that can be present either only in the head, like <title>, or both in the head and the body, like <script>. In this module, "%isHeadElement" copies the contents of HTML::Tagset rather than its documentation.

The issue in HTML::Tagset is reported as https://rt.cpan.org/Ticket/Display.html?id=109044.

Some elements of %isHeadElement are not head elements

This is reported as https://rt.cpan.org/Public/Bug/Display.html?id=109018.

COPYRIGHT & LICENSE

Portions of the documentation and tests of this module are taken from HTML::Tagset, which bears the following copyright notice.

Copyright 1995-2000 Gisle Aas.

Copyright 2000-2005 Sean M. Burke.

Copyright 2005-2008 Andy Lester.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Please note that the functional part of this code is not a fork of HTML::Tagset, it is based on HTML Tidy (the C program, not the Perl module HTML::Tidy).

HTML::Valid is based on HTML Tidy, which is under the following copyright:

    Copyright (c) 1998-2008 World Wide Web Consortium
    (Massachusetts Institute of Technology, European Research
    Consortium for Informatics and Mathematics, Keio University).
    All Rights Reserved.

    COPYRIGHT NOTICE:

    This software and documentation is provided "as is," and
    the copyright holders and contributing author(s) make no
    representations or warranties, express or implied, including
    but not limited to, warranties of merchantability or fitness
    for any particular purpose or that the use of the software or
    documentation will not infringe any third party patents,
    copyrights, trademarks or other rights.

    The copyright holders and contributing author(s) will not be held
    liable for any direct, indirect, special or consequential damages
    arising out of any use of the software or documentation, even if
    advised of the possibility of such damage.

    Permission is hereby granted to use, copy, modify, and distribute
    this source code, or portions hereof, documentation and executables,
    for any purpose, without fee, subject to the following restrictions:

    1. The origin of this source code must not be misrepresented.
    2. Altered versions must be plainly marked as such and must
    not be misrepresented as being the original source.
    3. This Copyright notice may not be removed or altered from any
    source or altered source distribution.

The Perl parts of this distribution are copyright (C) 2015 Ben Bullock and may be used under either the above licence terms, or the usual Perl conditions, either the GNU General Public Licence or the Perl Artistic Licence.