The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::TreeBuilder - Parser that builds a HTML syntax tree

SYNOPSIS

  foreach my $file_name (@ARGV) {
    my $tree = HTML::TreeBuilder->new; # empty tree
    $tree->parse_file($file_name);
    print "Hey, here's a dump of the parse tree of $file_name:\n";
    $tree->dump; # a method we inherit from HTML::Element
    print "And here it is, bizarrely rerendered as HTML:\n",
      $tree->as_HTML, "\n";
    
    # Now that we're done with it, we must destroy it.
    $tree = $tree->destroy;
  }

DESCRIPTION

This class is for HTML syntax trees. The way to use it is to:

1. start a new (empty) HTML::TreeBuilder object,

2. then use one of the methods from HTML::Parser (presumably with $tree->parse_file($filename) for files, or with $tree->parse($document_content) and $tree->eof if you've got the content in a string) to parse the HTML document into the tree $tree.

3. do whatever you need to do with the syntax tree, presumably involving traversing it looking for some bit of information in it,

4. and finally, when you're done with the tree, call $tree->delete to erase the contents of the tree from memory. This kind of thing usually isn't necessary with most Perl objects, but it's necessary for TreeBuilder objects. See HTML::Element for a more verbose explanation of why this is the case.

METHODS AND ATTRIBUTES

Objects of this class inherit the methods of both HTML::Parser and HTML::Element. The methods inherited from HTML::Parser are used for building the HTML tree, and the methods inherited from HTML::Element are what you use to scrutinize the tree. Besides this (HTML::TreeBuilder) documentation, you must also carefully read the HTML::Element documentation, and also skim the HTML::Parser documentation -- probably only its parse and parse_file methods are of interest.

The following methods native to HTML::TreeBuilder all control how parsing takes place; they should be set before you try parsing into the given object. You can set the attributes by passing a TRUE or FALSE value as argument. E.g., $p->implicit_tags returns the current setting for the implicit_tags option, $p->implicit_tags(1) turns that option on, and $p->implicit_tags(0) turns it off.

$p->implicit_tags(value)

Setting this attribute to true will instruct the parser to try to deduce implicit elements and implicit end tags. If it is false you get a parse tree that just reflects the text as it stands, which is unlikely to be useful for anything but quick and dirty parsing. Default is true.

Implicit elements have the implicit() attribute set.

$p->implicit_body_p_tag(value)

This controls an aspect of implicit element behavior, if implicit_tags is on: If a text element (PCDATA) or a phrasal element (such as "<em>") is to be inserted under "<body>", two things can happen: if implicit_body_p_tag is true, it's placed under a new, implicit "<p>" tag. (Past DTDs suggested this was the only correct behavior, and this is how past versions of this module behaved.) But if implicit_body_p_tag is false, nothing is implicated -- the PCDATA or phrasal element is simply placed under "<body>". Default is false.

$p->ignore_unknown(value)

This attribute controls whether unknown tags should be represented as elements in the parse tree, or whether they should be ignored. Default is true (to ignore unknown tags.)

$p->ignore_text(value)

Do not represent the text content of elements. This saves space if all you want is to examine the structure of the document. Default is false.

$p->ignore_ignorable_whitespace(value)

If set to true, TreeBuilder will try to delete (and/or to avoid creating) ignorable whitespace text nodes in the tree. Default is true. (In fact, I'd be interested in hearing if there's ever a case where you need this off, or where leaving it on leads to incorrect behavior.)

$p->warn(value)

This determines whether syntax errors during parsing should generate warnings, emitted via Perl's warn function.

HTML AND ITS DISCONTENTS

HTML is rather harder to parse than people who write it generally suspect.

Here's the problem: HTML is a kind of SGML that permits "minimization" and "implication". In short, this means that you don't have to close every tag you open (because the opening of a subsequent tag may implicitly close it), and if you use a tag that can't occur in the context you seem to using it in, under certain conditions the parser will be able to realize you mean to leave the current context and enter the new one, that being the only one that your code could correctly be interpreted in.

Now, this would all work flawlessly and unproblematically if: 1) all the rules that both prescribe and describe HTML were (and had been) clearly set out, and 2) everyone was aware of these rules and wrote their code in compliance to them.

However, it didn't happen that way, and so most HTML pages are difficult if not impossible to correctly parse with nearly any set of straightforward SGML rules. That's why the internals of HTML::TreeBuilder consist of lots and lots of special cases -- instead of being just a generic SGML parser with HTML DTD rules plugged in.

BUGS

* Currently, it's assumed that "HTML" is the top node in the tree, and that "HEAD" and "BODY" must be right under "HTML". Framesets are therefore coerced into being under "BODY", even if the document in question has the "BODY" inside a "NOFRAMES" element. This may change in a future version, particularly if anyone points out a case where this is troublesome for them.

* Bad HTML code will, often as not, make for a bad parse tree. Regrettable, but unavoidably true.

BUG REPORTS

When a document parses in a way different from how you think it should, I ask that you report this to me as a bug. The first thing you should do is copy the document, trim out as much of it as you can while still producing the bug in question, and then email me that mini-document at sburke@netadventure.net, with a note as to how it parses (presumably including its $tree->dump output), and then a careful and clear explanation of where you think the parser is going astray, and how you would prefer that it work instead.

SEE ALSO

HTML::Parser, HTML::Element

COPYRIGHT

Copyright 1995-1998 Gisle Aas, copyright 1999 Sean M. Burke.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

Original author Gisle Aas <gisle@aas.no>; current maintainer Sean M. Burke, <sburke@netadventure.net>