The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::DOM - A Perl implementation of the HTML Document Object Model

VERSION

Version 0.047 (alpha)

WARNING: This module is still at an experimental stage. The API is subject to change without notice.

SYNOPSIS

  use HTML::DOM;
  
  my $dom_tree = new HTML::DOM; # empty tree
  $dom_tree->write($source_code);
  $dom_tree->close;
  
  my $other_dom_tree = new HTML::DOM;
  $other_dom_tree->parse_file($filename);
  
  $dom_tree->getElementsByTagName('body')->[0]->appendChild(
           $dom_tree->createElement('input')
  );
  
  print $dom_tree->innerHTML, "\n";

  my $text = $dom_tree->createTextNode('text');
  $text->data;              # get attribute
  $text->data('new value'); # set attribute
  

DESCRIPTION

This module implements the HTML Document Object Model by extending the HTML::Tree modules. The HTML::DOM class serves both as an HTML parser and as the document class.

The following DOM modules are currently supported:

  Feature         Version (aka level)
  -------         -------------------
  HTML            2.0
  Core            2.0
  Events          2.0
  UIEvents        2.0
  MouseEvents     2.0
  MutationEvents  2.0
  HTMLEvents      2.0
  StyleSheets     2.0
  CSS             2.0 (partially)
  CSS2            2.0
  Views           2.0

StyleSheets, CSS and CSS2 are actually provided by CSS::DOM. This list corresponds to CSS::DOM versions 0.02 to 0.08.

METHODS

Construction and Parsing

$tree = new HTML::DOM %options;

This class method constructs and returns a new HTML::DOM object. The %options, which are all optional, are as follows:

url

The value that the URL method will return. This value is also used by the domain method.

referrer

The value that the referrer method will return

response

An HTTP::Response object. This will be used for information needed for writing cookies. It is expected to have a reference to a request object (accessible via its request method--see HTTP::Response). Passing a parameter to the 'cookie' method will be a no-op without this.

weaken_response

If this is passed a true value, then the HTML::DOM object will hold a weak reference to the response.

An HTTP::Cookies object. As with response, if you omit this, arguments passed to the cookie method will be ignored.

charset

The original character set of the document. This does not affect parsing via the write method (which always assumes Unicode). parse_file will use this, if specified, or HTML::Encoding otherwise. HTML::DOM::Form's make_request method uses this to encode form data unless the form has a valid 'accept-charset' attribute.

If referrer and url are omitted, they can be inferred from response.

$tree->elem_handler($elem_name => sub { ... })

If you call this method first, then, when the DOM tree is in the process of being built (as a result of a call to write or parse_file), the subroutine will be called after each $elem_name element is added to the tree. If you give '*' as the element name, the subroutine will be called for each element that does not have a handler. The subroutine's two arguments will be the tree itself and the element in question. The subroutine can call the DOM object's write method to insert HTML code into the source after the element.

Here is a lame example (which does not take Content-Script-Type headers or security into account):

  $tree->elem_handler(script => sub {
      my($document,$elem) = @_;
      return unless $elem->attr('type') eq 'application/x-perl';
      eval($elem->firstChild->data);
  });

  $tree->write(
      '<p>The time is
           <script type="application/x-perl">
                $document->write(scalar localtime)
           </script>
           precisely.
       </p>'
  );
  $tree->close;

  print $tree->documentElement->as_text, "\n";

(Note: HTML::DOM::Element's content_offset method might come in handy for reporting line numbers for script errors.)

css_url_fetcher( \&sub )

With this method you can provide a subroutine that fetches URLs referenced by 'link' tags. Its sole argument is the URL, which is made absolute based on the HTML page's own base URL (it is assumed that this is absolute). It should return undef or an empty list on failure. Upon success, it should return just the CSS code, if it has been decoded (and is in Unicode), or, if it has not been decoded, the CSS code followed by decode => 1. See "STYLE SHEET ENCODING" in CSS::DOM for details on when you should or should not decode it. (Note that HTML::DOM automatically provides an encoding hint based on the HTML document.)

HTML::DOM passes the result of the url fetcher to CSS::DOM and turns it into a style sheet object accessible via the link element's sheet method.

$tree->write(...) (DOM method)

This parses the HTML code passed to it, adding it to the end of the document. It assumes that its input is a normal Perl Unicode string. Like HTML::TreeBuilder's parse method, it can take a coderef.

When it is called from an an element handler (see elem_handler, above), the value passed to it will be inserted into the HTML code after the current element when the element handler returns. (In this case a coderef won't do--maybe that will be added later.)

If the close method has been called, write will call open before parsing the HTML code passed to it.

$tree->writeln(...) (DOM method)

Just like write except that it appends "\n" to its argument and does not work with code refs. (Rather pointless, if you ask me. :-)

$tree->close() (DOM method)

Call this method to signal to the parser that the end of the HTML code has been reached. It will then parse any residual HTML that happens to be buffered. It also makes the next write call open.

$tree->open (DOM method)

Deletes the HTML tree, resetting it so that it has just an <html> element, and a parser hungry for HTML code.

$tree->parse_file($file)

This method takes a file name or handle and parses the content, (effectively) calling close afterwards. In the former case (a file name), HTML::Encoding will be used to detect the encoding. In the latter (a file handle), you'll have to binmode it yourself. This could be considered a bug. If you have a solution to this (how to make HTML::Encoding detect an encoding from a file handle), please let me know.

As of version 0.12, this method returns true upon success, or undef/empty list on failure.

$tree->charset

This method returns the name of the character set that was passed to new, or, if that was not given, that which parse_file used.

It returns undef if new was not given a charset and if parse_file was not used or was passed a file handle.

You can also set the charset by passing an argument, in which case the old value is returned.

Other DOM Methods

doctype

Returns nothing

implementation

Returns the HTML::DOM::Implementation object.

documentElement

Returns the <html> element.

createElement ( $tag )
createDocumentFragment
createTextNode ( $text )
createComment ( $text )
createAttribute ( $name )

Each of these creates a node of the appropriate type.

createProcessingInstruction
createEntityReference

These two throw an exception.

getElementsByTagName ( $name )

$name can be the name of the tag, or '*', to match all tag names. This returns a node list object in scalar context, or a list in list context.

importNode ( $node, $deep )

Clones the $node, setting its ownerDocument attribute to the document with which this method is called. If $deep is true, the $node will be cloned recursively.

alinkColor
background
bgColor
fgColor
linkColor
vlinkColor

These six methods return (optionally set) the corresponding attributes of the body element. Note that most of the names do not map directly to the names of the attributes. fgColor refers to the text attribute. Those that end with 'linkColor' refer to the attributes of the same name but without the 'Color' on the end.

title

Returns (or optionally sets) the title of the page.

referrer

Returns the page's referrer.

domain

Returns the domain name portion of the document's URL.

URL

Returns the document's URL.

body

Returns the body element, or the outermost frame set if the document has frames. You can set the body by passing an element as an argument, in which case the old body element is returned.

images
applets
forms
anchors

These five methods each return a list of the appropriate elements in list context, or an HTML::DOM::Collection object in scalar context. In this latter case, the object will update automatically when the document is modified.

In the case of forms you can access those by using the HTML::DOM object itself as a hash. I.e., you can write $doc->{f} instead of $doc->forms->{f}.

This returns a string containing the document's cookies (the format may still change). If you pass an argument, it will set a cookie as well. Both Netscape-style and RFC2965-style cookie headers are supported.

getElementById
getElementsByName
getElementsByClassName

These two do what their names imply. The latter will return a list in list context, or a node list object in scalar context. Calling it in list context is probably more efficient.

createEvent ( $category )

Creates a new event object, believe it or not.

The $category is the DOM event category, which determines what type of event object will be returned. The currently supported event categories are MouseEvents, UIEvents, HTMLEvents and MutationEvents.

You can omit the $category to create an instance of the event base class (not officially part of the DOM).

defaultView

Returns the HTML::DOM::View object associated with the document.

There is no such object by default; you have to put one there yourself:

Although it is supposed to be read-only according to the DOM, you can set this attribute by passing an argument to it. It is still marked as read-only in %HTML::DOM::Interface.

If you do set it, it is recommended that the object be a subclass of HTML::DOM::View.

This attribute holds a weak reference to the object.

styleSheets

Returns a CSS::DOM::StyleSheetList of the document's style sheets, or a simple list in list context.

innerHTML

Serialises and returns the HTML document. If you pass an argument, it will set the contents of the document via open, write and close, returning a serialisation of the old contents.

location
set_location_object (non-DOM)

location returns the location object, if you've put one there with set_location_object. HTML::DOM doesn't actually implement such an object itself, but provides the appropriate magic to make $doc->location($foo) translate into $doc->location->href($foo).

BTW, the location object had better be true when used as a boolean, or HTML::DOM will think it doesn't exist.

lastModified

This method returns the document's modification date as gleaned from the response object passed to the constructor, in MM/DD/YYYY HH:MM:SS format.

If there is no modification date, an empty string is returned, but this may change in the future.

Other (Non-DOM) Methods

(See also "EVENT HANDLING", below.)

$tree->base

Returns the base URL of the page; either from a <base href=...> tag or the URL passed to new.

$tree->magic_forms

This is mainly for internal use. This returns a boolean indicating whether the parser needed to associate formies with a form that did not contain them. This happens when a closing </form> tag is missing and the form is closed implicitly, but a formie is encountered later.

HASH ACCESS

You can use an HTML::DOM object as a hash ref to access it's form elements by name. So $doc->{yayaya} is short for $doc->forms->{yayaya}.

EVENT HANDLING

HTML::DOM supports both the DOM Level 2 event model and the HTML 4 event model.

Throughout this documentation, we make use of HTML 5's distinction between handlers and listeners: An event handler is the result of an HTML element beginning with 'on', e.g. onsubmit. These are also accessible via the DOM. (We also use the word 'handler' in other contexts, such as the 'default event handler'.) Event listeners are registered solely with the addEventListener method and can be removed with removeEventListener.

HTML::DOM accepts as an event handler a coderef, an object with an call_with method, or an object with &{} overloading. If the call_with method is present, it is called with the current event target as the first argument and the event object as the second. This is to allow for objects that wrap JavaScript functions (which must be called with the event target as the this value).

An event listener is a coderef, an object with a handleEvent method or an object with &{} overloading. HTML::DOM does not implement any classes that provide a handleEvent method, but will support any object that has one.

Listeners and handlers differ in one important aspect. A listener has to call preventDefault on the event object to cancel the default action. A handler simply returns a defined false value (except for mouseover event, which must return a true value to cancel the default).

Default Actions

Default actions that HTML::DOM is capable of handling internally (such as triggering a DOMActivate event when an element is clicked, and triggering a form's submit event when the submit button is activated) are dealt with automatically. You don't have to worry about those. For others, read on....

To specify the default actions associated with an event, provide a subroutine (in this case, it not being part of the DOM, you can't use an object with a handleEvent method) via the default_event_handler_for and default_event_handler methods.

With the former, you can specify the default action to be taken when a particular type of event occurs. The currently supported types are:

  submit         when a form is submitted
  link           called when a link is activated (DOMActivate event)

Pass the type of event as the first argument and a code ref as the second argument. When the code ref is called, its sole argument will be the event object. For instance:

  $dom_tree->default_event_handler_for( link => sub {
         my $event = shift;
         go_to( $event->target->href );
  });
  sub go_to { ... }

default_event_handler_for with just one argument returns the currently assigned coderef. With two arguments it returns the old one after assigning the new one.

Use default_event_handler (without the _for) to specify a fallback subroutine that will be used for events not in the list above, and for events in the list above that do not have subroutines assigned to them. Without any arguments it will return the currently assigned coderef. With an argument it will return the old one after assigning the new one.

Dispatching Events

HTML::DOM::Node's dispatchEvent method triggers the appropriate event listeners, but does not call any default actions associated with it. The return value is a boolean that indicates whether the default action should be taken.

H:D:Node's trigger_event method will trigger the event for real. It will call dispatchEvent and, provided it returns true, will call the default event handler.

HTML Event Attributes

The event_attr_handler can be used to assign a coderef that will turn text assigned to an event attribute (e.g., onclick) into an event handler. The arguments to the routine will be (0) the element, (1) the name (aka type) of the event (without the initial 'on'), (2) the value of the attribute and (3) the offset within the source of the attribute's value. (Actually, if the value is within quotes, it is the offset of the first quotation mark. Also, it will be undef for generated HTML [source code passed to the write method by an element handler].) As with default_event_handler, you can replace an existing handler with a new one, in which case the old handler is returned. If you call this method without arguments, it returns the current handler. Here is an example of its use, that assumes that handlers are Perl code:

  $dom_tree->event_attr_handler(sub {
          my($elem, $name, $code, $offset) = @_;
          my $sub = eval "sub { $code }";
          return sub {
                  local *_ = \$elem;
                  &$sub;
          };
  });

The event attribute handler will be called whenever an element attribute whose name begins with 'on' (case-tolerant) is modified. (For efficiency's sake, I may change it to call the event attribute handler only when the event is triggered, so it is not called unnecessarily.)

When an Event Handler Dies

Use error_handler to assign a coderef that will be called whenever an event listener (or handler) raises an error. The error will be contained in $@.

$tree->event_parent
$tree->event_parent( $new_val )

This method lets you provide an object that is added to the top of the event dispatch chain. E.g., if you want the view object (the value of defaultView, aka the window) to have event handlers called before the document in the capture phase, and after it in the bubbling phase, you can set it like this (see also "defaultView", above):

  $tree->event_parent( $tree->defaultView );

This holds a weak reference.

$tree->event_listeners_enabled
$tree->event_listeners_enabled( $new_val )

This attribute, which is true by default, can be used to disable event handlers and listeners. (Default event handlers [see below] still run, though.)

CLASSES AND DOM INTERFACES

Here are the inheritance hierarchy of HTML::DOM's various classes and the DOM interfaces those classes implement. The classes in the left column all begin with 'HTML::', which is omitted for brevity. Items in brackets have not yet been implemented. (See also HTML::DOM::Interface for a machine-readable list of standard methods.)

  Class Inheritance Hierarchy             Interfaces
  ---------------------------             ----------
  
  DOM::Exception                          DOMException, EventException
  DOM::Implementation                     DOMImplementation,
                                           [DOMImplementationCSS]
  Element
      DOM::Node                           Node, EventTarget
          DOM::DocumentFragment           DocumentFragment
          DOM                             Document, HTMLDocument,
                                            DocumentEvent, DocumentView,
                                            DocumentStyle, [DocumentCSS]
          DOM::CharacterData              CharacterData
              DOM::Text                   Text
              DOM::Comment                Comment
          DOM::Element                    Element, HTMLElement,
                                            ElementCSSInlineStyle
              DOM::Element::HTML          HTMLHtmlElement
              DOM::Element::Head          HTMLHeadElement
              DOM::Element::Link          HTMLLinkElement, LinkStyle
              DOM::Element::Title         HTMLTitleElement
              DOM::Element::Meta          HTMLMetaElement
              DOM::Element::Base          HTMLBaseElement
              DOM::Element::IsIndex       HTMLIsIndexElement
              DOM::Element::Style         HTMLStyleElement, LinkStyle
              DOM::Element::Body          HTMLBodyElement
              DOM::Element::Form          HTMLFormElement
              DOM::Element::Select        HTMLSelectElement
              DOM::Element::OptGroup      HTMLOptGroupElement
              DOM::Element::Option        HTMLOptionElement
              DOM::Element::Input         HTMLInputElement
              DOM::Element::TextArea      HTMLTextAreaElement
              DOM::Element::Button        HTMLButtonElement
              DOM::Element::Label         HTMLLabelElement
              DOM::Element::FieldSet      HTMLFieldSetElement
              DOM::Element::Legend        HTMLLegendElement
              DOM::Element::UL            HTMLUListElement
              DOM::Element::OL            HTMLOListElement
              DOM::Element::DL            HTMLDListElement
              DOM::Element::Dir           HTMLDirectoryElement
              DOM::Element::Menu          HTMLMenuElement
              DOM::Element::LI            HTMLLIElement
              DOM::Element::Div           HTMLDivElement
              DOM::Element::P             HTMLParagraphElement
              DOM::Element::Heading       HTMLHeadingElement
              DOM::Element::Quote         HTMLQuoteElement
              DOM::Element::Pre           HTMLPreElement
              DOM::Element::Br            HTMLBRElement
              DOM::Element::BaseFont      HTMLBaseFontElement
              DOM::Element::Font          HTMLFontElement
              DOM::Element::HR            HTMLHRElement
              DOM::Element::Mod           HTMLModElement
              DOM::Element::A             HTMLAnchorElement
              DOM::Element::Img           HTMLImageElement
              DOM::Element::Object        HTMLObjectElement
              DOM::Element::Param         HTMLParamElement
              DOM::Element::Applet        HTMLAppletElement
              DOM::Element::Map           HTMLMapElement
              DOM::Element::Area          HTMLAreaElement
              DOM::Element::Script        HTMLScriptElement
              DOM::Element::Table         HTMLTableElement
              DOM::Element::Caption       HTMLTableCaptionElement
              DOM::Element::TableColumn   HTMLTableColElement
              DOM::Element::TableSection  HTMLTableSectionElement
              DOM::Element::TR            HTMLTableRowElement
              DOM::Element::TableCell     HTMLTableCellElement
              DOM::Element::FrameSet      HTMLFrameSetElement
              DOM::Element::Frame         HTMLFrameElement
              DOM::Element::IFrame        HTMLIFrameElement
  DOM::NodeList                           NodeList
      DOM::NodeList::Radio
  DOM::NodeList::Magic                    NodeList
  DOM::NamedNodeMap                       NamedNodeMap
  DOM::Attr                               Node, Attr, EventTarget
  DOM::Collection                         HTMLCollection
      DOM::Collection::Elements
      DOM::Collection::Options
  DOM::Event                              Event
      DOM::Event::UI                      UIEvent
          DOM::Event::Mouse               MouseEvent
      DOM::Event::Mutation                MutationEvent
  DOM::View                               AbstractView, ViewCSS

The EventListener interface is not implemented by HTML::DOM, but is supported. See "EVENT HANDLING", above.

Not listed above is HTML::DOM::EventTarget, which is a base class both for HTML::DOM::Node and HTML::DOM::Attr. The format I'm using above doesn't allow for multiple inheritance, so I probably need to redo it.

Although HTML::DOM::Node inherits from HTML::Element, the interface is not entirely compatible. In particular:

  • Any methods that expect text nodes to be just strings are unreliable. See the note under "objectify_text" in HTML::Element.

  • HTML::Element's tree-manipulation methods don't trigger mutation events.

  • HTML::Element's delete method is not necessary, because HTML::DOM uses weak references (for 'upward' references in the object tree).

IMPLEMENTATION NOTES

  • Objects' attributes are accessed via methods of the same name. When the method is invoked, the current value is returned. If an argument is supplied, the attribute is set (unless it is read-only) and its old value returned.

  • Where the DOM spec. says to use null, undef or an empty list is used.

  • Instead of UTF-16 strings, HTML::DOM uses Perl's Unicode strings (which happen to be stored as UTF-8 internally). The only significant difference this makes is to length, substringData and other methods of Text and Comment nodes. These methods behave in a Perlish way (i.e., the offsets and lengths are specified in Unicode characters, not in UTF-16 bytes). The alternate methods length16, substringData16 et al. use UTF-16 for offsets and are standards-compliant in that regard (but the string returned by substringData is still a regular Perl string).

  • Each method that returns a NodeList will return a NodeList object in scalar context, or a simple list in list context. You can use the object as an array ref in addition to calling its item and length methods.

  • In cases where a method is supposed to return something implementing the DOMTimeStamp interface, a simple Perl scalar is returned, containing the time as returned by Perl’s built-in time function.

PREREQUISITES

perl 5.8.3 or later

Exporter 5.57 or later

HTML::TreeBuilder and HTML::Element (both part of the HTML::Tree distribution)

URI.pm

LWP 5.13 or later

CSS::DOM 0.06 or later

Scalar::Util 1.14 or later

HTML::Encoding is required if a file name is passed to parse_file.

Tie::RefHash::Weak 0.08 or higher, if you are using perl 5.8.x

BUGS

-

Element handlers are not currently called during assignments to innerHTML.

-

HTML::DOM::View's getComputedStyle does not currently return a read-only style object; nor are lengths converted to absolute values. Currently there is no way to specify the medium. Any style rules that apply to specific media are ignored.

To report bugs, please e-mail the author.

AUTHOR, COPYRIGHT & LICENSE

Copyright (C) 2007-11 Father Chrysostomos

  $text = new HTML::DOM ->createTextNode('sprout');
  $text->appendData('@');
  $text->appendData('cpan.org');
  print $text->data, "\n";

This program is free software; you may redistribute it and/or modify it under the same terms as perl.

SEE ALSO

Each of the classes listed above "CLASSES AND DOM INTERFACES"

HTML::DOM::Exception, HTML::DOM::Node, HTML::DOM::Event, HTML::DOM::Interface

HTML::Tree, HTML::TreeBuilder, HTML::Element, HTML::Parser, LWP, WWW::Mechanize, HTTP::Cookies, WWW::Mechanize::Plugin::JavaScript, HTML::Form, HTML::Encoding

The DOM Level 1 specification at http://www.w3.org/TR/REC-DOM-Level-1

The DOM Level 2 Core specification at http://www.w3.org/TR/DOM-Level-2-Core

The DOM Level 2 Events specification at http://www.w3.org/TR/DOM-Level-2-Events

etc.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1402:

Non-ASCII character seen before =encoding in 'I’ve'. Assuming UTF-8