The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

XML::Handler::HTMLWriter - SAX Handler for writing HTML 4.0

SYNOPSIS

  use XML::Handler::HTMLWriter;
  use XML::SAX;
  
  my $writer = XML::Handler::HTMLWriter->new(...);
  my $parser = XML::SAX::ParserFactory->parser(Handler => $writer);
  ...

DESCRIPTION

This module is based on the rules for outputting HTML according to http://www.w3.org/TR/xslt - the XSLT specification. It is a subclass of XML::SAX::Writer, and the usage is the same as that module.

Usage

First create a new HTMLWriter object:

  my $writer = XML::Handler::HTMLWriter->new(...);

The ... indicates parameters to be passed in. These are all passed in using the hash syntax: Key => Value.

All parameters are from XML::SAX::Writer, so please see its documentation for more details.

Now pass $writer to a SAX chain:

e.g. a SAX parser:

  my $parser = XML::SAX::ParserFactory->parser(Handler => $writer);

Or a SAX filter:

  my $tolower = XML::Filter::ToLower->new(Handler => $writer);

Or use in a SAX Machine:

  use XML::SAX::Machines qw(Pipeline);
  
  Pipeline(
     XML::Filter::XSLT->new(Source => { SystemId => 'foo.xsl' })
        =>
     XML::Handler::HTMLWriter->new
  )->parse_uri('foo.xml');

Initiate processing

XML::Handler::HTMLWriter never initiates processing itself, since it is just a recepticle for SAX events. So you have to start processing on one of the modules higher up the chain. For example in the XML::SAX parser case:

  $parser->parse(Source => { SystemId => "foo.xhtml" });

Get the results

Results work via the consumer interface as defined in XML::SAX::Writer.

HTML Output Methodology

Here is the relevant excerpt from TR/xslt [note that a bit of an understanding of XSLT is necessary to read this, but don't worry - understanding isn't necessary to use this module :-)]:

The html output method should not output an element differently from the xml output method unless the expanded-name of the element has a null namespace URI; an element whose expanded-name has a non-null namespace URI should be output as XML. If the expanded-name of the element has a null namespace URI, but the local part of the expanded-name is not recognized as the name of an HTML element, the element should output in the same way as a non-empty, inline element such as span.

The html output method should not output an end-tag for empty elements. For HTML 4.0, the empty elements are area, base, basefont, br, col, frame, hr, img, input, isindex, link, meta and param. For example, an element written as <br/> or <br></br> in the stylesheet should be output as <br>.

The html output method should recognize the names of HTML elements regardless of case. For example, elements named br, BR or Br should all be recognized as the HTML br element and output without an end-tag.

The html output method should not perform escaping for the content of the script and style elements. For example, a literal result element written in the stylesheet as

  <script>if (a &lt; b) foo()</script>

or

  <script><![CDATA[if (a < b) foo()]]></script>

should be output as

  <script>if (a < b) foo()</script>

The html output method should not escape < characters occurring in attribute values.

If the indent attribute has the value yes, then the html output method may add or remove whitespace as it outputs the result tree, so long as it does not change how an HTML user agent would render the output. The default value is yes.

The html output method should escape non-ASCII characters in URI attribute values using the method recommended in Section B.2.1 of the HTML 4.0 Recommendation.

The html output method may output a character using a character entity reference, if one is defined for it in the version of HTML that the output method is using.

The html output method should terminate processing instructions with > rather than ?>.

The html output method should output boolean attributes (that is attributes with only a single allowed value that is equal to the name of the attribute) in minimized form. For example, a start-tag written in the stylesheet as

  <OPTION selected="selected">

should be output as

  <OPTION selected>

The html output method should not escape a & character occurring in an attribute value immediately followed by a { character (see Section B.7.1 of the HTML 4.0 Recommendation). For example, a start-tag written in the stylesheet as

  <BODY bgcolor='&amp;{{randomrbg}};'>

should be output as

  <BODY bgcolor='&{randomrbg};'>

The encoding attribute specifies the preferred encoding to be used. If there is a HEAD element, then the html output method should add a META element immediately after the start-tag of the HEAD element specifying the character encoding actually used. For example,

  <HEAD>
  <META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
  ...

It is possible that the result tree will contain a character that cannot be represented in the encoding that the XSLT processor is using for output. In this case, if the character occurs in a context where HTML recognizes character references, then the character should be output as a character entity reference or decimal numeric character reference; otherwise (for example, in a script or style element or in a comment), the XSLT processor should signal an error.

If the doctype-public or doctype-system attributes are specified, then the html output method should output a document type declaration immediately before the first element. The name following <!DOCTYPE should be HTML or html. If the doctype-public attribute is specified, then the output method should output PUBLIC followed by the specified public identifier; if the doctype-system attribute is also specified, it should also output the specified system identifier following the public identifier. If the doctype-system attribute is specified but the doctype-public attribute is not specified, then the output method should output SYSTEM followed by the specified system identifier.

The media-type attribute is applicable for the html output method. The default value is text/html.

Entities

HTML characters are output using HTML::Entities. See HTML::Entities for more details. By default, XML::Handler::HTMLWriter uses the default parameters to HTML::Entities::encode(), but I would be willing to investigate the worth in passing more parameters in.

SAX1 or SAX2?

Previous versions of this module worked with both SAX1 and SAX2, but actually implemented the translation in quite a broken manner. So now this module only works with SAX 2. See http://sax.perl.org for more details.

AUTHOR

Matt Sergeant, matt@sergeant.org

SEE ALSO

XML::SAX::Writer, XML::SAX::ParserFactory.