HTML::DOM - A Perl implementation of the HTML Document Object Model
Version 0.010 (alpha)
WARNING: This module is still at an experimental stage. The API is subject to change without notice.
use HTML::DOM; my $dom_tree = new HTML::DOM; # empty tree $dom_tree->parse_file($filename); $dom_tree->getElementsByTagName('body')->[0]->appendChild( $dom_tree->createElement('input') ); print $dom_tree->documentElement->as_HTML, "\n"; # (inherited from HTML::Element) my $text = $dom_tree->createTextNode('text'); $text->data; # get attribute $text->data('new value'); # set attribute
This module implements the HTML Document Object Model by extending the HTML::Tree modules. The HTML::DOM class serves both as an HTML parser and as the document class.
The following DOM modules are currently supported:
Feature Version (aka level) ------- ------------------- HTML 1.0 Core 2.0 Events 2.0 (partially) StyleSheets 2.0 (partially) CSS 2.0 (partially) CSS2 2.0 Views 2.0
StyleSheets, CSS and CSS2 are actually provided by CSS::DOM. This list corresponds to CSS::DOM version 0.01.
CSS::DOM
This class method constructs and returns a new HTML::DOM object. The %options, which are all optional, are as follows:
%options
The value that the URL method will return. This value is also used by the domain method.
URL
domain
The value that the referrer method will return
referrer
An HTTP::Response object. This will be used for information needed for writing cookies. It is expected to have a reference to a request object (accessible via its request method--see HTTP::Response). Passing a parameter to the 'cookie' method will be a no-op without this.
request
An HTTP::Cookies object. As with response, if you omit this, arguments passed to the cookie method will be ignored.
response
cookie
If referrer and url are omitted, they can be inferred from response.
url
Not yet implemented.
This method has no effect unless you call it before building the DOM tree. If you call this method, then, when the DOM tree is in the process of being built, the subroutine will be called after each $elem_name element is added to the tree. If you give '*' as the element name, the subroutine will be called for each element that does not have a handler. The subroutine's two arguments will be the tree itself and the element in question. The subroutine can call the DOM object's write method to insert HTML code into the source after the element.
$elem_name
write
Here is a lame example (which does not take Content-Script-Type headers or security into account):
$tree->elem_handler(script => sub { my($document,$elem) = @_; return unless $elem->attr('type') eq 'application/x-perl'; eval($elem->firstChild->data); }); $tree->write( '<p>The time is <script type="application/x-perl"> $document->write(scalar localtime) </script> precisely. </p>' ); $tree->close; print $tree->documentElement->as_text, "\n";
(Note: HTML::DOM::Element's content_offset method might come in handy for reporting line numbers for script errors.)
content_offset
BUG: The 'open' method currently undoes what this method does.
This simply calls HTML::TreeBuilder's method of the same name (q.v., and see also HTML::Element). It takes a file name or handle and parses the content, (effectively) calling close afterwards.
close
This parses the HTML code passed to it, adding it to the end of the document. Like HTML::TreeBuilder's parse method, it can take a coderef.
parse
When it is called from an an element handler (see elem_handler, above), the value passed to it will be inserted into the HTML code after the current element when the element handler returns. (In this case a coderef won't do--maybe that will be added later.)
elem_handler
If the close method has been called, write will call open before parsing the HTML code passed to it.
open
Just like write except that it appends "\n" to its argument and does not work with code refs. (Rather pointless, if you ask me. :-)
Call this method to signal to the parser that the end of the HTML code has been reached. It will then parse any residual HTML that happens to be buffered. It also makes the next write call open.
Deletes the HTML tree, reseting it so that it has just an <html> element, and a parser hungry for HTML code.
Returns nothing
Returns the HTML::DOM::Implementation object.
Returns the <html> element.
Each of these creates a node of the appropriate type.
These two throw an exception.
$name can be the name of the tag, or '*', to match all tag names. This returns a node list object in scalar context, or a list in list context.
$name
Clones the $node, setting its ownerDocument attribute to the document with which this method is called. If $deep is true, the $node will be cloned recursively.
$node
ownerDocument
$deep
These six methods return (optionally set) the corresponding attributes of the body element. Note that most of the names do not map directly to the names of the attributes. fgColor refers to the text attribute. Those that end with 'linkColor' refer to the attributes of the same name but without the 'Color' on the end.
fgColor
text
Returns (or optionally sets) the title of the page.
Returns the page's referrer.
Returns the domain name portion of the document's URL.
Returns the document's URL.
Returns the body element, or the outermost frame set if the document has frames. You can set the body by passing an element as an argument, in which case the old body element is returned. In this case you should call delete on the return value to remove circular references, unless you plan to use it still. E.g.,
delete
$doc->body($new_body)->delete;
These five methods return a list of the appropriate elements in list context, or an HTML::DOM::Collection object in scalar context. In this latter case, the object will update automatically when the document is modified.
In the case of forms you can access those by using the HTML::DOM object itself as a hash. I.e., you can write $doc->{f} instead of $doc->forms->{f}.
forms
$doc->{f}
$doc->forms->{f}
TO DO: I need to make these methods cache the HTML collection objects that they create. Once I've done this, I can make list context use those objects, as well as scalar context.
This returns a string containing the document's cookies (the format may still change). If you pass an argument, it will set a cookie as well. Both Netscape-style and RFC2965-style cookie headers are supported.
These two do what their names imply. The latter will return a list in list context, or a node list object in scalar context. Calling it in list context is probably more efficient.
This currently ignores its args. Later the arg passed to it will determine into which class the newly-created event object is blessed.
Returns the HTML::DOM::View object associated with the document.
See "EVENT HANDLING", below.
Returns the base URL of the page; either from a <base href=...> tag or the URL passed to new.
new
You can use an HTML::DOM object as a hash ref to access it's form elements by name. So $doc->{yayaya} is short for $doc->forms->{yayaya}.
$doc->{yayaya}
$doc->forms->{yayaya}
HTML::DOM supports both the DOM Level 2 event model and the HTML 4 event model (at least in part, so far [in particular, the Event base class is implemented, but none of its subclasses; no events are triggered automatically yet]).
An event listener (aka handler) is a coderef, an object with a handleEvent method or an object with &{} overloading. HTML::DOM does not implement any classes that provide a handleEvent method, but will support any object that has one.
handleEvent
&{}
To specify the default actions associated with an event, provide a subroutine via the default_event_handler method. The sole argument will be the event object. For instance:
default_event_handler
$dom_tree->default_event_handler(sub { my $event = shift; my $type = $event->type; my $tag = (my $target = $event->target)->nodeName; if ($type eq 'click' && $tag eq 'A') { # ... } # etc. });
default_event_handler without any arguments will return the currently assigned coderef. With an argument it will return the old one after assigning the new one.
Currently no default actions are taken when events are triggered. It is up to the default event handler to do that. Later I will allow for multiple default event handlers to be assigned to more specific events, and a few will be in place to begin with (e.g., for a submit button's 'click' event, the form's 'submit' event will be triggered; currently it is not).
HTML::DOM::Node's dispatchEvent method triggers the appropriate event listeners, but does not call any default actions associated with it. The return value is a boolean that indicates whether the default action should be taken.
dispatchEvent
H:D:Node's trigger_event method will trigger the event for real. It will call dispatchEvent and, provided it returns true, will call the default event handler.
trigger_event
The event_attr_handler can be used to assign a coderef that will turn text assigned to an event attribute (e.g., onclick) into a listener. The arguments to the routine will be (0) the element, (1) the name (aka type) of the event (without the initial 'on') and (2) the value of the attribute. As with default_event_handler, you can replace an existing handler with a new one, in which case the old handler is returned. If you call this method without arguments, it returns the current handler. Here is an example of its use, that assumes that handlers are Perl code:
event_attr_handler
onclick
$dom_tree->event_attr_handler(sub { my($elem, $name, $code) = @_; my $sub = eval "sub { $code }"; return sub { my($event) = @_; local *_ = \$elem; my $ret = &$sub; defined $ret and !$ret and $event->preventDefault; }; });
The event attribute handler will be called whenever an element attribute whose name begins with 'on' (case-tolerant) is modified.
Use error_handler to assign a coderef that will be called whenever an event listener raises an error. The error will be contained in $@.
error_handler
$@
Here are the inheritance hierarchy of HTML::DOM's various classes and the DOM interfaces those classes implement. The classes in the left column all begin with 'HTML::', which is omitted for brevity. Items in brackets have not yet been implemented. (See also HTML::DOM::Interface for a machine-readable list of standard methods.)
Class Inheritance Hierarchy Interfaces --------------------------- ---------- DOM::Exception DOMException, EventException DOM::Implementation DOMImplementation, [DOMImplementationCSS] Element DOM::Node Node, EventTarget DOM::DocumentFragment DocumentFragment DOM Document, HTMLDocument, DocumentEvent, DocumentView, [DocumentStyle, DocumentCSS] DOM::CharacterData CharacterData DOM::Text Text DOM::Comment Comment DOM::Element Element, HTMLElement, ElementCSSInlineStyle DOM::Element::HTML HTMLHtmlElement DOM::Element::Head HTMLHeadElement DOM::Element::Link HTMLLinkElement, [LinkStyle] DOM::Element::Title HTMLTitleElement DOM::Element::Meta HTMLMetaElement DOM::Element::Base HTMLBaseElement DOM::Element::IsIndex HTMLIsIndexElement DOM::Element::Style HTMLStyleElement, [LinkStyle] DOM::Element::Body HTMLBodyElement DOM::Element::Form HTMLFormElement DOM::Element::Select HTMLSelectElement DOM::Element::OptGroup HTMLOptGroupElement DOM::Element::Option HTMLOptionElement DOM::Element::Input HTMLInputElement DOM::Element::TextArea HTMLTextAreaElement DOM::Element::Button HTMLButtonElement DOM::Element::Label HTMLLabelElement DOM::Element::FieldSet HTMLFieldSetElement DOM::Element::Legend HTMLLegendElement DOM::Element::UL HTMLUListElement DOM::Element::OL HTMLOListElement DOM::Element::DL HTMLDListElement DOM::Element::Dir HTMLDirectoryElement DOM::Element::Menu HTMLMenuElement DOM::Element::LI HTMLLIElement DOM::Element::Div HTMLDivElement DOM::Element::P HTMLParagraphElement DOM::Element::Heading HTMLHeadingElement DOM::Element::Quote HTMLQuoteElement DOM::Element::Pre HTMLPreElement DOM::Element::Br HTMLBRElement DOM::Element::BaseFont HTMLBaseFontElement DOM::Element::Font HTMLFontElement DOM::Element::HR HTMLHRElement DOM::Element::Mod HTMLModElement DOM::Element::A HTMLAnchorElement DOM::Element::Img HTMLImageElement DOM::Element::Object HTMLObjectElement DOM::Element::Param HTMLParamElement DOM::Element::Applet HTMLAppletElement DOM::Element::Map HTMLMapElement DOM::Element::Area HTMLAreaElement DOM::Element::Script HTMLScriptElement DOM::Element::Table HTMLTableElement DOM::Element::Caption HTMLTableCaptionElement DOM::Element::TableColumn HTMLTableColElement DOM::Element::TableSection HTMLTableSectionElement DOM::Element::TR HTMLTableRowElement DOM::Element::TableCell HTMLTableCellElement DOM::Element::FrameSet HTMLFrameSetElement DOM::Element::Frame HTMLFrameElement DOM::Element::IFrame HTMLIFrameElement DOM::NodeList NodeList DOM::NodeList::Radio DOM::NodeList::Magic NodeList DOM::NamedNodeMap NamedNodeMap DOM::Attr Node, Attr DOM::Collection HTMLCollection DOM::Collection::Elements DOM::Collection::Options DOM::Event Event [DOM::Event::UI UIEvent] [DOM::Event::Mouse MouseEvent] [DOM::Event::Mutation MutationEvent] DOM::View AbstractView, [ViewCSS]
Although HTML::DOM::Node inherits from HTML::Element, the interface is not entirely compatible, so don't rely on any HTML::Element methods.
The EventListener interface is not implemented by HTML::DOM, but is supported. See "EVENT HANDLING", above.
Objects' attributes are accessed via methods of the same name. When the method is invoked, the current value is returned. If an argument is supplied, the attribute is set (unless it is read-only) and its old value returned.
Where the DOM spec. says to use null, undef or an empty list is used.
Instead of UTF-16 strings, HTML::DOM uses Perl's Unicode strings (which happen to be stored as UTF-8 internally). The only significant difference this makes is to length, substringData and other methods of Text and Comment nodes. These methods behave in a Perlish way (i.e., the offsets and lengths are specified in Unicode characters, not in UTF-16 bytes). The alternate methods length16, substringData16 et al. use UTF-16 for offsets and are standards-compliant in that regard (but the string returned by substringData is still a regular Perl string).
length
substringData
length16
substringData16
Each method that returns a NodeList will return a NodeList object in scalar context, or a simple list in list context. You can use the object as an array ref in addition to calling its item and length methods.
item
perl 5.6.0 or later
Exporter 5.57 or later
HTML::TreeBuilder and HTML::Element (both part of the HTML::Tree distribution) (tested with 3.23)
URI.pm (tested with 1.35)
HTTP::Headers::Util is required if you pass an argument to the cookie method after passing an HTTP::Response and a cookie jar to the constructor (in which case you most certainly already have HTTP::Headers::Util :-). (tested with 1.13)
HTML::Form 1.054 or later if any of the methods provided for WWW::Mechanize compatibility are called.
CSS::DOM is required if you use any of the style sheet features.
Scalar::Util 1.08 or later
(See also BUGS in HTML::DOM::Collection::Options/BUGS and HTML::DOM::Element::Option/BUGS)
I really don't know what will happen if a element handler goes and deletes parent elements of the element for which the handler is called.
The values of attributes whose data type is a value list in the HTML DTD are currently not normalised when accessed through the Level 0 interface, though they should be.
The values of boolean attributes are currently not normalised when accessed through the Level 0 interface, though they should be.
Certain HTML attributes are supposed to have default values if they are not in specified in the document. This is not implemented yet, except for a few cases here and there.
The open method currently delete's any of the HTML::DOM object's references to subroutines that were passed to elem_handler.
The removeChild method of an HTML::DOM object currently throws a 'Can't call method "_modified" on an undefined value' error.
removeChild
To report bugs, please e-mail the author.
Copyright (C) 2007 Father Chrysostomos
$text = new HTML::DOM ->createTextNode('sprout'); $text->appendData('@'); $text->appendData('cpan.org'); print $text->data, "\n";
This program is free software; you may redistribute it and/or modify it under the same terms as perl.
HTML::DOM::Exception, HTML::DOM::Node, HTML::DOM::Event, HTML::DOM::Interface
HTML::Tree, HTML::TreeBuilder, HTML::Element, HTML::Parser, LWP, WWW::Mechanize, HTTP::Cookies, WWW::Mechanize::Plugin::JavaScript, HTML::Form
The DOM Level 1 specification at http://www.w3.org/TR/REC-DOM-Level-1
The DOM Level 2 Core specification at http://www.w3.org/TR/DOM-Level-2-Core
The DOM Level 2 Events specification at http://www.w3.org/TR/DOM-Level-2-Events
etc.
4 POD Errors
The following errors were encountered while parsing the POD:
'=item' outside of any '=over'
Non-ASCII character seen before =encoding in 'Perl’s'. Assuming UTF-8
You forgot a '=back' before '=head1'
alternative text 'HTML::DOM::Collection::Options/BUGS' contains non-escaped | or /
alternative text 'HTML::DOM::Element::Option/BUGS' contains non-escaped | or /
To install HTML::DOM, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::DOM
CPAN shell
perl -MCPAN -e shell install HTML::DOM
For more information on module installation, please visit the detailed CPAN module installation guide.