The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::DOM - A Perl implementation of the HTML Document Object Model

VERSION

Version 0.005 (alpha)

WARNING: This module is still at an experimental stage. Only a few features have been implemented so far. The API is subject to change without notice.

SYNOPSIS

  use HTML::DOM;
  
  my $dom_tree = new HTML::DOM; # empty tree
  $dom_tree->parse_file($filename);
  
  $dom_tree->getElementsByTagName('body')->[0]->appendChild(
           $dom_tree->createElement('input')
  );
  
  print $dom_tree->documentElement->as_HTML, "\n";
  # (inherited from HTML::Element)

  my $text = $dom_tree->createTextNode('text');
  $text->data;              # get attribute
  $text->data('new value'); # set attribute
  

DESCRIPTION

This module implements the HTML Document Object Model by extending the HTML::Tree modules. The HTML::DOM class serves both as an HTML parser and as the document class.

METHODS

Non-DOM Methods

$tree = new HTML::DOM %options;

This class method constructs and returns a new HTML::DOM object. The %options, which are all optional, are as follows:

url

The value that the URL method will return. This value is also used by the domain method.

referrer

The value that the referrer method will return

response

An HTTP::Response object. This will be used for information needed for writing cookies. It is expected to have a reference to a request object (accessible via its request method--see HTTP::Response). Passing a parameter to the 'cookie' method will be a no-op without this.

An HTTP::Cookies object. As with response, if you omit this, arguments passed to the cookie method will be ignored.

If referrer and url are omitted, they can be inferred from response.

$tree = new_from_file HTML::DOM
$tree = new_from_content HTML::DOM

Not yet implemented.

$tree->elem_handler($elem_name => sub { ... })

This method has no effect unless you call it before building the DOM tree. If you call this method, then, when the DOM tree is in the process of being built, the subroutine will be called after each $elem_name element is added to the tree. If you give '*' as the element name, the subroutine will be called for each element that does not have a handler. The subroutine's two arguments will be the tree itself and the element in question. The subroutine can call the DOM object's write method to insert HTML code into the source after the element.

Here is a lame example (which does not take Content-Script-Type headers or security into account):

  $tree->elem_handler(script => sub {
      my($document,$elem) = @_;
      return unless $elem->attr('type') eq 'application/x-perl';
      eval($elem->firstChild->data);
  });

  $tree->parse(
      '<p>The time is
           <script type="application/x-perl">
                $document->write(scalar localtime)
           </script>
           precisely.
       </p>'
  );
  $tree->eof;

  print $tree->documentElement->as_text, "\n";
$tree->parse_file($file)
$tree->parse(...)
$tree->eof()

These three methods simply call HTML::TreeBuilder's methods with the same name (q.v., and see also HTML::Element), but note that parse_file may only be called once for each HTML::DOM object (since it deletes its parser when it no longer needs it), unless you reset the object by calling the open method. Similarly, parse may not be called after eof (again, unless you call open first, which is what write does automatically, so I don't know why I even bother keeping the parse method at all; maybe I should do away with it).

$tree->event_attr_handler
$tree->default_event_handler

See "EVENT HANDLING", below.

DOM Methods

(This section needs to be written.)

etc. etc. etc.
alinkColor
background
bgColor
fgColor
linkColor
vlinkColor

These six methods return (optionally set) the corresponding attributes of the body element. Note that most of the names do not map directly to the names of the attributes. fgColor refers to the text attribute. Those that end with 'linkColor' refer to the attributes of the same name but without the 'Color' on the end.

These don't work yet, and won't work until HTML::DOM::Element::Body is implemented.

title

Returns (or optionally sets) the title of the page.

referrer

Returns the page's referrer.

domain

Returns the domain name portion of the document's URL.

URL

Returns the document's URL.

body

Returns the body element, or the outermost frame set if the document has frames. You can set the body by passing an element as an argument, in which case the old body element is returned. In this case you should call delete on the return value to remove circular references, unless you plan to use it still. E.g.,

  $doc->body($new_body)->delete;
images
applets
forms
anchors

These five methods return a list of the appropriate elements in list context, or an HTML::DOM::Collection object in scalar context. In this latter case, the object will update automatically when the document is modified.

TO DO: I need to make these methods cache the HTML collection objects that they create. Once I've done this, I can make list context use those objects, as well as scalar context.

This returns a string containing the document's cookies (the format may still change). If you pass an argument, it will set a cookie as well. Both Netscape-style and RFC2965-style cookie headers are supported.

open

Resets the document to the state it was in immediately after calling new. If you have a subclass that has its own attributes inside the object, they will be wiped out.

close

An alias to eof (flushes any HTML code that might be buffered after calling write/parse, and makes the next write call open)

write

When this is called from an an element handler (see elem_handler, above), the value passed to it will be inserted into the HTML code after the current element when the element handler returns.

Otherwise it appends the HTML code to the current document (via parse), unless eof has been called, in which case it calls open before calling parse.

writeln

Just like write except that it appends "\n" to its argument. (Rather pointless, if you ask me. :-)

getElementById
getElementsByName

These two do what their names imply. The latter will return a list in list context, or a node list object in scalar context. Calling it in list context is probably more efficient.

createEvent

This currently ignores its args. Later the arg passed to it will determine into which class the newly-created event object is blessed.

EVENT HANDLING

HTML::DOM supports both the DOM Level 2 event model and the HTML 4 event model (at least in part, so far [in particular, the Event base class is implemented, but none of its subclasses; no events are triggered automatically yet]).

An event listener (aka handler) is a coderef, an object with a handleEvent method or an object with &{} overloading. HTML::DOM does not implement any classes that provide a handleEvent method, but will support any object that has one.

To specify the default actions associated with an event, provide a subroutine via the default_event_handler method. The first argument will be the event object. For instance:

  $dom_tree->default_event_handler(sub {
         my($self, $event) = @_;
         my $type = $event->type;
         my $tag = (my $target = $event->target)->nodeName;
         if ($type eq 'click' && $tag eq 'A') {
                # ...
         }
         # etc.
  });

default_event_handler without any arguments will return the currently assigned coderef. With an argument it will return the old one after assigning the new one.

HTML::DOM::Node's dispatchEvent method triggers the appropriate event listeners, but does not call any default actions associated with it. The return value is a boolean that indicates whether the default action should be taken.

H:D:Node's trigger_event method will trigger the event for real. It will call dispatchEvent and, provided it returns true, will call the default event handler.

The event_attr_handler can be used to assign a coderef that will turn text assigned to an event attribute (e.g., onclick) into a listener. The arguments to the routine will be (0) the element, (1) the name (aka type) of the event (without the initial 'on') and (2) the value of the attribute. As with default_event_handler, you can replace an existing handler with a new one, in which case the old handler is returned. If you call this method without arguments, it returns the current handler. Here is an example of its use, that assumes that handlers are Perl code:

  $dom_tree->event_attr_handler(sub {
          my($elem, $name, $code) = @_;
          my $sub = eval "sub { $code }";
          return sub {
                  my($event) = @_;
                  local *_ = \$elem;
                  my $ret = &$sub;
                  defined $ret and !$ret and
                          $event->preventDefault;
          };
  });

The event attribute handler will be called whenever an element attribute whose name begins with 'on' (case-tolerant) is modified.

CLASSES AND DOM INTERFACES

Here are the inheritance hierarchy of HTML::DOM's various classes and the DOM interfaces those classes implement. The Classes in the left column all begin with 'HTML::', which is omitted for brevity. Items in brackets have not yet been implemented.

  Class Inheritance Hierarchy             Interfaces
  ---------------------------             ----------
  
  DOM::Exception                          DOMException, EventException
  DOM::Implementation                     DOMImplementation
  Element
      DOM::Node                           Node, EventTarget
          DOM::DocumentFragment           DocumentFragment
          DOM                             Document, HTMLDocument,
                                            DocumentEvent
          DOM::CharacterData              CharacterData
              DOM::Text                   Text
              DOM::Comment                Comment
          DOM::Element                    Element, HTMLElement
              DOM::Element::HTML          HTMLHtmlElement
              DOM::Element::Head          HTMLHeadElement
              DOM::Element::Link          HTMLLinkElement
              DOM::Element::Title         HTMLTitleElement
              DOM::Element::Meta          HTMLMetaElemen
              DOM::Element::Base          HTMLBaseElement
              DOM::Element::IsIndex       HTMLIsIndexElement
              DOM::Element::Style         HTMLStyleElement
              DOM::Element::Body          HTMLBodyElement
             [DOM::Element::Form          HTMLFormElement]
             [DOM::Element::Select        HTMLSelectElement]
             [DOM::Element::OptGroup      HTMLOptGroupElement]
             [DOM::Element::Option        HTMLOptionElement]
             [DOM::Element::Input         HTMLInputElement]
             [DOM::Element::TextArea      HTMLTextAreaElement]
             [DOM::Element::Button        HTMLButtonElement]
             [DOM::Element::Label         HTMLLabelElement]
             [DOM::Element::FieldSet      HTMLFieldSetElement]
             [DOM::Element::Legend        HTMLLegendElement]
             [DOM::Element::UL            HTMLUListElement]
             [DOM::Element::OL            HTMLOListElement]
             [DOM::Element::DL            HTMLDListElement]
             [DOM::Element::Dir           HTMLDirectoryElement]
             [DOM::Element::Menu          HTMLMenuElement]
             [DOM::Element::LI            HTMLLIElement]
             [DOM::Element::Div           HTMLDivElement]
             [DOM::Element::P             HTMLParagraphElement]
             [DOM::Element::Heading       HTMLHeadingElement]
             [DOM::Element::Quote         HTMLQuoteElement]
             [DOM::Element::Pre           HTMLPreElement]
             [DOM::Element::Br            HTMLBRElement]
             [DOM::Element::BaseFont      HTMLBaseFontElement]
             [DOM::Element::Font          HTMLFontElement]
             [DOM::Element::HR            HTMLHRElement]
             [DOM::Element::Mod           HTMLModElement]
             [DOM::Element::A             HTMLAnchorElement]
             [DOM::Element::Img           HTMLImageElement]
             [DOM::Element::Object        HTMLObjectElement]
             [DOM::Element::Param         HTMLParamElement]
             [DOM::Element::Applet        HTMLAppletElement]
             [DOM::Element::Map           HTMLMapElement]
             [DOM::Element::Area          HTMLAreaElement]
             [DOM::Element::Script        HTMLScriptElement]
             [DOM::Element::Table         HTMLTableElement]
             [DOM::Element::Caption       HTMLTableCaptionElement]
             [DOM::Element::TableColumn   HTMLTableColElement]
             [DOM::Element::TableSection  HTMLTableSectionElement]
             [DOM::Element::TR            HTMLTableRowElement]
             [DOM::Element::TableCell     HTMLTableCellElement]
             [DOM::Element::FrameSet      HTMLFrameSetElement]
             [DOM::Element::Frame         HTMLFrameElement]
             [DOM::Element::IFrame        HTMLIFrameElement]
  DOM::NodeList                           NodeList
  DOM::NodeList::Magic                    NodeList
  DOM::NamedNodeMap                       NamedNodeMap
  DOM::Attr                               Node, Attr
  DOM::Collection                         HTMLCollection
  DOM::Event                              Event

Although HTML::DOM::Node inherits from HTML::Element, the interface is not entirely compatible, so don't rely on any HTML::Element methods.

The EventListener interface is not implemented by HTML::DOM, but is supported. See "EVENT HANDLING", above.

IMPLEMENTATION NOTES

  • Node attributes are accessed via methods of the same name. When the method is invoked, the current value is returned. If an argument is supplied, the attribute is set (unless it is read-only) and its old value returned.

  • Where the DOM spec. says to use null, undef or an empty list is used.

  • Instead of UTF-16 strings, HTML::DOM uses Perl's Unicode strings (which happen to be stored as UTF-8 internally). The only significant difference this makes is to length, substringData and other methods of Text and Comment nodes. These methods behave in a Perlish way (i.e., the offsets and lengths are specified in Unicode characters, not in UTF-16 bytes). The alternate methods length16, substringData16 et al. use UTF-16 for offsets and are standards-compliant in that regard (but the string returned by substringData is still a regular Perl string).

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 967:

=over without closing =back

Around line 997:

'=end for me' is invalid. (Stack: =over; =begin for)