The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::TreeBuilder - Parser that builds a HTML syntax tree

SYNOPSIS

  foreach my $file_name (@ARGV) {
    my $tree = HTML::TreeBuilder->new; # empty tree
    $tree->parse_file($file_name);
    print "Hey, here's a dump of the parse tree of $file_name:\n";
    $tree->dump; # a method we inherit from HTML::Element
    print "And here it is, bizarrely rerendered as HTML:\n",
      $tree->as_HTML, "\n";
    
    # Now that we're done with it, we must destroy it.
    $tree = $tree->delete;
  }

DESCRIPTION

This class is for HTML syntax trees that get built out of HTML source. The way to use it is to:

1. start a new (empty) HTML::TreeBuilder object,

2. then use one of the methods from HTML::Parser (presumably with $tree->parse_file($filename) for files, or with $tree->parse($document_content) and $tree->eof if you've got the content in a string) to parse the HTML document into the tree $tree.

3. do whatever you need to do with the syntax tree, presumably involving traversing it looking for some bit of information in it,

4. and finally, when you're done with the tree, call $tree->delete to erase the contents of the tree from memory. This kind of thing usually isn't necessary with most Perl objects, but it's necessary for TreeBuilder objects. See HTML::Element for a more verbose explanation of why this is the case.

METHODS AND ATTRIBUTES

Objects of this class inherit the methods of both HTML::Parser and HTML::Element. The methods inherited from HTML::Parser are used for building the HTML tree, and the methods inherited from HTML::Element are what you use to scrutinize the tree. Besides this (HTML::TreeBuilder) documentation, you must also carefully read the HTML::Element documentation, and also skim the HTML::Parser documentation -- probably only its parse and parse_file methods are of interest.

The following methods native to HTML::TreeBuilder all control how parsing takes place; they should be set before you try parsing into the given object. You can set the attributes by passing a TRUE or FALSE value as argument. E.g., $root->implicit_tags returns the current setting for the implicit_tags option, $root->implicit_tags(1) turns that option on, and $root->implicit_tags(0) turns it off.

$root->implicit_tags(value)

Setting this attribute to true will instruct the parser to try to deduce implicit elements and implicit end tags. If it is false you get a parse tree that just reflects the text as it stands, which is unlikely to be useful for anything but quick and dirty parsing. (And, in current versions, $root-> Default is true.

Implicit elements have the implicit() attribute set.

$root->implicit_body_p_tag(value)

This controls an aspect of implicit element behavior, if implicit_tags is on: If a text element (PCDATA) or a phrasal element (such as "<em>") is to be inserted under "<body>", two things can happen: if implicit_body_p_tag is true, it's placed under a new, implicit "<p>" tag. (Past DTDs suggested this was the only correct behavior, and this is how past versions of this module behaved.) But if implicit_body_p_tag is false, nothing is implicated -- the PCDATA or phrasal element is simply placed under "<body>". Default is false.

$root->ignore_unknown(value)

This attribute controls whether unknown tags should be represented as elements in the parse tree, or whether they should be ignored. Default is true (to ignore unknown tags.)

$root->ignore_text(value)

Do not represent the text content of elements. This saves space if all you want is to examine the structure of the document. Default is false.

$root->ignore_ignorable_whitespace(value)

If set to true, TreeBuilder will try to avoid creating ignorable whitespace text nodes in the tree. Default is true. (In fact, I'd be interested in hearing if there's ever a case where you need this off, or where leaving it on leads to incorrect behavior.)

$root->p_strict(value)

If set to true (and it defaults to false), TreeBuilder will take a narrower than normal view of what can be under a "p" element; if it sees a non-phrasal element about to be inserted under a "p", it will close that "p". Otherwise it will close p elements only for other "p"'s, headings, and "form" (altho the latter may be removed in future versions).

For example, when going thru this snippet of code,

  <p>stuff
  <ul>

TreeBuilder will normally (with p_strict false) put the "ul" element under the "p" element. However, with p_strict set to true, it will close the "p" first.

In theory, there should be strictness options like this for other/all elements besides just "p"; but I treat this as a specal case simply because of the fact that "p" occurs so frequently and its end-tag is omitted so often; and also because application of strictness rules at parse-time across all elements often makes tiny errors in HTML coding produce drastically bad parse-trees, in my experience.

If you find that you wish you had an option like this to enforce content-models on all elements, then I suggest that what you want is content-model checking as a stage after TreeBuilder has finished parsing.

$root->store_comments(value)

This determines whether TreeBuilder will normally store comments found while parsing content into $root. Currently, this is off by default.

$root->store_declarations(value)

This determines whether TreeBuilder will normally store markup declarations found while parsing content into $root. Currently, this is off by default.

It is somewhat of a known bug (to be fixed one of these days, if anyone needs it?) that declarations in the preamble (before the "html" start-tag) end up actually under the "html" element.

$root->store_pis(value)

This determines whether TreeBuilder will normally store processing instructions found while parsing content into $root -- assuming a recent version of HTML::Parser (old versions won't parse PIs correctly). Currently, this is off (false) by default.

It is somewhat of a known bug (to be fixed one of these days, if anyone needs it?) that PIs in the preamble (before the "html" start-tag) end up actually under the "html" element.

$root->warn(value)

This determines whether syntax errors during parsing should generate warnings, emitted via Perl's warn function.

This is off (false) by default.

HTML AND ITS DISCONTENTS

HTML is rather harder to parse than people who write it generally suspect.

Here's the problem: HTML is a kind of SGML that permits "minimization" and "implication". In short, this means that you don't have to close every tag you open (because the opening of a subsequent tag may implicitly close it), and if you use a tag that can't occur in the context you seem to using it in, under certain conditions the parser will be able to realize you mean to leave the current context and enter the new one, that being the only one that your code could correctly be interpreted in.

Now, this would all work flawlessly and unproblematically if: 1) all the rules that both prescribe and describe HTML were (and had been) clearly set out, and 2) everyone was aware of these rules and wrote their code in compliance to them.

However, it didn't happen that way, and so most HTML pages are difficult if not impossible to correctly parse with nearly any set of straightforward SGML rules. That's why the internals of HTML::TreeBuilder consist of lots and lots of special cases -- instead of being just a generic SGML parser with HTML DTD rules plugged in.

BUGS

* Hopefully framesets behave correctly now. Email me if you find a strange parse of documents with framesets.

* Bad HTML code will, often as not, make for a bad parse tree. Regrettable, but unavoidably true.

* If you're running with implicit_tags off (God help you!), consider that $tree->content_list probably contains the tree or grove from the parse, and not $tree itself (which will, oddly enough, be an implicit 'html' element). This seems counter-intuitive and problematic; but seeing as how almost no HTML ever parses correctly with implicit_tags off, this interface oddity seems the least of your problems.

BUG REPORTS

When a document parses in a way different from how you think it should, I ask that you report this to me as a bug. The first thing you should do is copy the document, trim out as much of it as you can while still producing the bug in question, and then email me that mini-document and the code you're using to parse it, at sburke@cpan.org. Include a note as to how it parses (presumably including its $tree->dump output), and then a careful and clear explanation of where you think the parser is going astray, and how you would prefer that it work instead.

SEE ALSO

HTML::Parser, HTML::Element, HTML::Tagset

COPYRIGHT

Copyright 1995-1998 Gisle Aas; copyright 1999, 2000 Sean M. Burke.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

Original author Gisle Aas <gisle@aas.no>; current maintainer Sean M. Burke, <sburke@cpan.org>