The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::Element - Class for objects that represent HTML elements

SYNOPSIS

  use HTML::Element;
  $a = HTML::Element->new('a', href => 'http://www.perl.com/');
  $a->push_content("The Perl Homepage");

  $tag = $a->tag;
  print "$tag starts out as:",  $a->starttag, "\n";
  print "$tag ends as:",  $a->endtag, "\n";
  print "$tag\'s href attribute is: ", $a->attr('href'), "\n";

  $links_r = $a->extract_links();
  print "Hey, I found ", scalar(@$links_r), " links.\n";
  
  print "And that, as HTML, is: ", $a->as_HTML, "\n";
  $a = $a->delete;

DESCRIPTION

Objects of the HTML::Element class can be used to represent elements of HTML. These objects have attributes, notably attributes that designates the elements's parent and content. The content is an array of text segments and other HTML::Element objects. A tree with HTML::Element objects as nodes can represent the syntax tree for a HTML document.

HOW WE REPRESENT TREES

It may occur to you to wonder what exactly a "tree" is, and how it's represented in memory. Consider this HTML document:

  <html lang='en-US'>
    <head>
      <title>Stuff</title>
      <meta name='author' content='Jojo'>
    </head>
    <body>
     <h1>I like potatoes!</h1>
    </body>
  </html>

Building a syntax tree out of it makes a tree-structure in memory that could be diagrammed as:

                     html (lang='en-US')
                      / \
                    /     \
                  /         \
                head        body
               /\               \
             /    \               \
           /        \               \
         title     meta              h1
          |       (name='author',     |
       "Stuff"    content='Jojo')    "I like potatoes"

This is the traditional way to diagram a tree, with the "root" at the top, and it's this kind of diagram that people have in mind when they say, for example, that "the meta element is under the head element instead of under the body element". (The same is also said with "inside" instead of "under" -- the use of "inside" makes more sense when you're looking at the HTML source.)

Another way to represent the above tree is with indenting:

  html (attributes: lang='en-US')
    head
      title
        "Stuff"
      meta (attributes: name='author' content='Jojo')
    body
      h1
        "I like potatoes"

Incidentally, diagramming with indenting works much better for very large trees, and is easier for a program to generate. The $tree->dump method uses indentation just that way.

However you diagram the tree, it's stored the same in memory -- it's a network of objects, each of which has attributes like so:

  element #1:  _tag: 'html'
               _parent: none
               _content: [element #2, element #5]
               lang: 'en-US'

  element #2:  _tag: 'head'
               _parent: element #1
               _content: [element #3, element #4]

  element #3:  _tag: 'title'
               _parent: element #2
               _content: [text segment "Stuff"]

  element #4   _tag: 'meta'
               _parent: element #2
               _content: none
               name: author
               content: Jojo

  element #5   _tag: 'body'
               _parent: element #1
               _content: [element #6]

  element #6   _tag: 'h1'
               _parent: element #5
               _content: [text segment "I like potatoes"]

The "treeness" of the tree-structure that these elements comprise is not an aspect of any particular object, but is emergent from the relatedness attributes (_parent and _content) of these element-objects and from how you use them to get from element to element.

While you could access the content of a tree by writing code that says "access the 'src' attribute of the root's first child's seventh child's third child", you're more likely to have to scan the contents of a tree, looking for whatever nodes, or kinds of nodes, you want to do something with. The most straightforward way to look over a tree is to "traverse" it; an HTML::Element method ($h->traverse) is provided for this purpose; and several other HTML::Element methods are based on it.

(For everything you ever wanted to know about trees, and then some, see Donald Knuth's The Art of Computer Programming, Volume 1.)

BASIC METHODS

$h = HTML::Element->new('tag', 'attrname' => 'value', ... )

This constructor method returns a new HTML::Element object. The tag name is a required argument; it will be forced to lowercase. Optionally, you can specify other initial attributes at object creation time.

$h->tag() or $h->tag('tagname')

Returns (optionally sets) the tag name (also known as the generic identifier) for the element $h. The tag name is always converted to lower case.

$h->content()

Returns the content of this element -- i.e., what is inside/under this element. The return value is either undef (which you should understand to mean no content), or a reference to the array of content items, each of which is either a text segment, or an HTML::Element object.

$h->parent() or $h->parent($new_parent)

Returns (optionally sets) the parent for this element. (If you're thinking about using this to attach or detach nodes, instead consider $new_parent->push_content($h), $new_parent->unshift_content($h), or $h->detach.)

$h->implicit() or $h->implicit($bool)

Returns (optionally sets) the implicit attribute. This attribute is used to indicate that the element was not originally present in the source, but was added to the parse tree (by HTML::TreeBuilder, for example) in order to conform to the rules of HTML structure.

$h->pos() or $h->pos($element)

Returns (and optionally sets) the "current position" pointer of $h. This "pos" attribute is a pointer used during some parsing operations, whose value is whatever HTML::Element element at or under $h is currently "open", where $h->insert_element(NEW) will actually insert a new element.

(This has nothing to do with the Perl function called "pos", for controlling where regular expression matching starts.)

If you set $h->pos($element), be sure that $element is either $h, or an element under $h.

If you've been modifying the tree under $h and are no longer sure $h->pos is valid, you can enforce validity with:

    $h->pos(undef) unless $h->pos->is_under($h);
$h->attr('attr') or $h->attr('attr', 'value')

Returns (optionally sets) the value of the given attribute of $h. The attribute name (but not the value, if provided) is forced to lowercase. If setting a new value, the old value of that attribute is returned.

$h->all_attr()

Returns all this element's attributes and values, as attribute-value pairs.

STRUCTURE-MODIFYING METHODS

While you theoretically could modify a tree by directly manipulating objects' parent and content attributes, it's much simpler (and less error-prone), to use these methods:

$h->insert_element($element, $implicit)

Inserts a new element under the element at $h->pos(). Then updates $h->pos() to point to the inserted element, unless $element is a prototypically empty element like "br", "hr", "img", etc. The new $h->pos() is returned.

$h->push_content($element_or_text, ...)

Adds the specified items to the end of the content list of the element $h. The items of content to be added should each be either a text segment (a string) or an HTML::Element object.

$h->unshift_content($element_or_text, ...)

Adds the specified items to the beginning of the content list of the element $h. The items of content to be added should each be either a text segment (a string) or an HTML::Element object.

$h->splice_content($offset, $length, $element_or_text, ...)

Removes the elements designated by $offset and $length from the content-list of element $h, and replaces them with the elements of the following list, if any. Returns the elements removed from the array. If $offset is negative, then it starts that far from the end of the array. If $length and the following list are omitted, removes everything from $offset onward.

The items of content to be added should each be either a text segment (a string) or an HTML::Element object, and should not already be children of $h.

$h->detach()

This unlinks $h from its parent, by setting its 'parent' attribute to undef, and by removing it from the content list of its parent (if it had one). The return value is the parent that was detached from (or undef, if $h had no parent to start with). Note that neither $h nor its parent are explicitly destroyed.

$h->replace_with_content()

This replaces $h in its parent's content list with its own content. The element $h (which by then has no parent or content of its own) is returned. This causes a fatal error if $h has no parent. Also, note that this does not destroy $h -- use $h->replace_with_content->delete if you need that.

$h->delete_content()

Clears the content of $h, calling $i->delete for each content element.

Returns $h.

$h->delete()

Removes this element from its parent (if it has one) and explicitly destroys the element and all its descendants. The return value is undef.

Perl uses garbage collection based on reference counting; when no references to a data structure exist, it's implicitly destroyed -- i.e., when no value anywhere points to a given object anymore, Perl knows it can free up the memory that the now-unused object occupies.

But this fails with HTML::Element trees, because a parent element always holds references to its children, and its children elements hold references to the parent, so no element ever looks like it's not in use. So, to destroy those elements, you need to call $h->delete on the parent.

$h->clone()

Returns a copy of the element (whose children are clones (recursively) of the original's children, if any).

The returned element is parentless. Any pos attributes present in the source element/tree will be absent in the copy.

HTML::Element->clone_list(...nodes...)
or: ref($h)->clone_list(...nodes...)

Returns a list consisting of a copy of each node given. Text segments are simply copied; elements are cloned by calling $it->clone on each of them.

DUMPING METHODS

$h->dump()

Prints the element and all its children to STDOUT, in a format useful only for debugging. The structure of the document is shown by indentation (no end tags).

$h->as_HTML() or $h->as_HTML($entities)
or $h->as_HTML($entities, $indent_char)

Returns a string representing in HTML the element and its children. The optional argument $entities specifies a string of the entities to encode. For compatibility with previous versions, specify '<>&' here. If omitted or undef, all unsafe characters are encoded as HTML entities. See HTML::Entities for details.

If $indent_char is specified and defined, the HTML to be output is intented, using the string you specify (which you probably should set to "\t", or some number of spaces, if you specify it). This feature is currently somewhat experimental. But try it, and feel free to email me any bug reports. (Note that output, although indented, is not wrapped. Patches welcome.)

$h->as_text()
$h->as_text(skip_dels => 1)

Returns a string that represents only the text parts of the element's descendants. Entities are decoded to corresponding ISO-8859-1 (Latin-1) characters. See HTML::Entities for more information.

If skip_dels is true, then text content under "del" nodes is not included in what's returned.

$h->starttag() or $h->starttag($entities)

Returns a string representing the complete start tag for the element. I.e., leading "<", tag name, attributes, and trailing ">". Attributes values that don't consist entirely of digits are surrounded with double-quotes, and appropriate characters are encoded. If $entities is omitted or undef, all unsafe characters are encoded as HTML entities. See HTML::Entities for details. If you specify some value for $entities, remember to include the double-quote character in it. (Previous versions of this module would basically behave as if '&">' were specified for $entities.)

$h->endtag()

Returns a string representing the complete end tag for this element. I.e., "</", tag name, and ">".

SECONDARY STRUCTURAL METHODS

These methods all involve some structural aspect of the tree; either they report some aspect of the tree's structure, or they involve traversal down the tree, or walking up the tree.

$h->is_inside('tag', ...) or $h->is_inside($element, ...)

Returns true if the $h element is, or is contained anywhere inside an element that is any of the ones listed, or whose tag name is any of the tag names listed.

$h->is_empty()

Returns true if $h has no content, i.e., has no elements or text segments under it. In other words, this returns true if $h is a leaf node, AKA a terminal node. Do not confuse this sense of "empty" with another sense that it can have in SGML/HTML/XML terminology, which means that the element in question is of the type (like HTML's "hr", "br", "img", etc.) that can't have any content.

That is, a particular "p" element may happen to have no content, so $that_p_element->is_empty will be true -- even though the prototypical "p" element isn't "empty" (in the way that the prototypical "hr" element is).

$h->pindex()

Return the index of the element in its parent's contents array, such that $h would equal $h->parent->content->[$h->pindex], assuming $h isn't root. If the element $h is root, then $h->pindex returns undef.

$h->address()

Returns a string representing the location of this node in the tree. The address consists of numbers joined by a '.', starting with '0', and followed by the pindexes of the nodes in the tree that are ancestors of $h, starting from the top.

So if the way to get to a node starting at the root is to go to child 2 of the root, then child 10 of that, and then child 0 of that, and then you're there -- then that node's address is "0.2.10.0".

As a bit of a special case, the address of the root is simply "0".

I forsee this being used mainly for debugging.

$h->address($address)

This returns the node (whether element or text-segment) at the given address in the tree that $h is a part of. (That is, the address is resolved starting from $h->root.)

If there is no node at the given address, this returns undef.

$h->depth()

Returns a number expressing $h's depth within its tree, i.e., how many steps away it is from the root. If $h has no parent (i.e., is root), its depth is 0.

$h->root()

Returns the element that's the top of $h's tree. If $h is root, this just returns $h. (If you want to test whether $h is the root, instead of asking what its root is, just test not($h->parent).)

$h->lineage()

Returns the list of $h's ancestors, starting with its parent, and then that parent's parent, and so on, up to the root. If $h is root, this returns an empty list.

If you simply want a count of the number of elements in $h's lineage, use $h->depth.

$h->lineage_tag_names()

Returns the list of the tag names of $h's ancestors, starting with its parent, and that parent's parent, and so on, up to the root. If $h is root, this returns an empty list. Example output: ('html', 'body', 'table', 'tr', 'td', 'em')

$h->descendants()

In list context, returns the list of all $h's descendant elements, listed in pre-order (i.e., an element appears before its content-elements). Text segments do not appear in the list. In scalar context, returns a count of all such elements.

$h->traverse(\&callback)
or $h->traverse(\&callback, $ignore_text)

Traverse the element and all of its children. For each node visited, the callback routine is called with these arguments:

    $_[0] : the node (element or text segment),
    $_[1] : a startflag, and
    $_[2] : the depth

If the $ignore_text parameter is given and true, then the callback will not be called for text content.

The startflag is 1 when we enter a node (i.e., in pre-order calls) and 0 when we leave the node (in post-order calls). Note, however, that post-order calls don't happen for nodes that are text segments or elements that are prototypically empty (like "br", "hr", etc.).

If the returned value is false from the pre-order call to the callback, then the children will not be traversed, nor will the callback be called in post-order for that node.

If $ignore_text is given and false (so we do visit text nodes, instead of ignoring them), then when text nodes are visited, we will also pass two extra arguments to the callback:

    $_[3] : the element that's the parent
             of this text node
    $_[4] : the index of this text node
             in its parent's content list

The source code for HTML::Element and HTML::TreeBuilder contain several examples of the use of the "traverse" method.

(Note: you should not change the structure of a tree while you are traversing it.)

$h->find_by_tag_name('tag', ...)

In list context, returns a list of elements at or under $h that have any of the specified tag names. In scalar context, returns the first (in pre-order traversal of the tree) such element found, or undef if none.

$h->find_by_attribute('attribute', 'value')

In a list context, returns a list of elements at or under $h that have the specified attribute, and have the given value for that attribute. In a scalar context, returns the first (in pre-order traversal of the tree) such element found, or undef if none.

$h->attr_get_i('attribute')

In list context, returns a list consisting of the values of the given attribute for $self and for all its ancestors starting from $self and working its way up. Nodes with no such attribute are skipped. ("attr_get_i" stands for "attribute get, with inheritance".) In scalar context, returns the first such value, or undef if none.

Consider a document consisting of:

   <html lang='i-klingon'>
     <head><title>Pati Pata</title></head>
     <body>
       <h1 lang='la'>Stuff</h1>
       <p lang='es-MX' align='center'>
         Foo bar baz <cite>Quux</cite>.
       </p>
       <p>Hooboy.</p>
     </body>
   </html>

If $h is the "cite" element, $h->attr_get_i("lang") in list context will return the list ('es-MX', 'i-klingon'). In scalar context, it will return the value 'es-MX'.

$h->extract_links() or $h->extract_links(@wantedTypes)

Returns links found by traversing the element and all of its children and looking for attributes (like "href" in an "a" element, or "src" in an "img" element) whose values represent links. The return value is a reference to an array. Each element of the array is reference to an array with two items: the link-value and a the element that has the attribute with that link-value. You may or may not end up using the element itself -- for some purposes, you may use only the link value.

You might specify that you want to extract links from just some kinds of elements (instead of the default, which is to extract links from all the kinds of elements known to have attributes whose values represent links). For instance, if you want to extract links from only "a" and "img" elements, you could code it like this:

  for (@{  $e->extract_links('a', 'img')  }) {
      my($link, $element) = @$_;
      print
        "Hey, there's a ", $element->tag,
        " that links to $link\n";
  }
$h->same_as($i)

Returns true if $h and $i are both elements representing the same tree of elements, each with the same tag name, with the same explicit attributes (i.e., not counting attributes whose names start with "_"), and with the same content (textual, comments, etc.).

Sameness of descendant elements is tested, recursively, with $child1->same_as($child_2), and sameness of text segments is tested with $segment1 eq $segment2.

$h = HTML::Element->new_from_lol(ARRAYREF)

Resursively constructs a tree of nodes, based on the (non-cyclic) data structure represented by ARRAYREF, where that is a reference to an array of arrays (of arrays (of arrays (etc.))). In each arrayref in that structure: arrayrefs are considered to designate a sub-tree representing children for the node constructed from the current arrayref; hashrefs are considered to contain attribute-value pairs to add to the element to be constructed from the current arrayref; text segments at the start of any arrayref will be considered to specify the name of the element to be constructed from the current araryref; all other text segments will be considered to specify text segments as children for the current arrayref.

An example will hopefully make this more obvious:

  my $h = HTML::Element->new_from_lol(
    ['html',
      ['head',
        [ 'title', 'I like stuff!' ],
      ],
      ['body',
        {'lang', 'en-JP', _implicit => 1},
        'stuff',
        ['p', 'um, p < 4!', {'class' => 'par123'}],
        ['div', {foo => 'bar'}, '123'],
      ]
    ]
  );
  $h->dump;

Will print this:

  <html> @0
    <head> @0.0
      <title> @0.0.0
        "I like stuff!"
    <body lang="en-JP"> @0.1 (IMPLICIT)
      "stuff"
      <p class="par123"> @0.1.1
        "um, p < 4!"
      <div foo="bar"> @0.1.2
        "123"

And printing $h->as_HTML will give something like:

  <html><head><title>I like stuff!</title></head>
  <body lang="en-JP">stuff<p class="par123">um, p &lt; 4!
  <div foo="bar">123</div></body></html>

BUGS

* If you want to free the memory associated with a tree built of HTML::Element nodes, then you will have to delete it explicitly. See the $h->delete method, above.

* There's almost nothing to stop you from making a "tree" with cyclicities (loops) in it, which could, for example, make the traverse method go into an infinite loop. So don't make cyclicities! (If all you're doing is parsing HTML files, and looking at the resulting trees, this will never be a problem for you.)

* There's no way to represent comments or processing directives in a tree with HTML::Elements. Not yet, at least.

SEE ALSO

HTML::AsSubs, HTML::TreeBuilder

COPYRIGHT

Copyright 1995-1998 Gisle Aas, 1999-2000 Sean M. Burke.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

Original author Gisle Aas <gisle@aas.no>; current maintainer Sean M. Burke, <sburke@netadventure.net>