HTML::HTML5::Parser - Parse HTML reliably with Perl.
0.01
use HTML::HTML5::Parser; my $parser = HTML::HTML5::Parser->new; my $doc = $parser->parse_string(<<'EOT'); <!doctype html> <title>Foo</title> <p><b><i>Foo</b> bar</i>. <p>Baz</br>Quux. EOT my $fdoc = $parser->parse_file( $html_file_name ); my $fhdoc = $parser->parse_fh( $html_file_handle );
This library is substantially the same as the non-CPAN module Whatpm::HTML. Changes include:
Provides an XML::LibXML-like DOM interface. If you usually use XML::LibXML's DOM parser, this should be a drop-in solution for tag soup HTML.
Constructs an XML::LibXML::Document as the result of parsing.
Via bundling and modifications, removed external dependencies on non-CPAN packages.
NOTE: This module uses Inline::Python, including the Python "chardet" package which can be installed using easy_install.
new
$parser = HTML::HTML5::Parser->new;
The constructor does not do anything interesting.
parse_file
parse_html_file
$doc = $parser->parse_file( $html_file_name [,\%opts] );
This function parses an HTML document from a file or network; $html_file_name can be either a filename or an URL.
$html_file_name
Options include 'encoding' to indicate file encoding (e.g. 'utf-8') and 'user_agent' which should be a blessed LWP::UserAgent object to be used when retrieving URLs.
LWP::UserAgent
If requesting a URL and the response Content-Type header indicates an XML-based media type (such as XHTML), XML::LibXML::Parser will be used automatically (instead of the tag soup parser). Tag soup parsing can be forced using the option 'force_html'. If an options hashref was passed, parse_file will set $options->{'parser_used'} to the name of the class used to parse the URL, to allow the calling code to double-check which parser was used afterwards.
If an options hashref was passed, parse_file will set $options->{'response'} to the HTTP::Response object obtained by retrieving the URI.
parse_fh
parse_html_fh
$doc = $parser->parse_fh( $io_fh [,\%opts] );
parse_fh() parses a IOREF or a subclass of IO::Handle.
parse_fh()
IO::Handle
Options include 'encoding' to indicate file encoding (e.g. 'utf-8').
parse_string
parse_html_string
$doc = $parser->parse_string( $html_string [,\%opts] );
This function is similar to parse_fh(), but it parses an HTML document that is available as a single string in memory.
The push parser and SAX-based parser are not supported. Trying to change an option (such as recover_silently) will make HTML::HTML5::Parser carp a warning. (But you can inspect the options.)
The module provides a few additional methods to obtain additional, non-DOM data from DOM nodes.
compat_mode
$mode = $parser->compat_mode( $doc );
Returns 'quirks', 'limited quirks' or undef (standards mode).
dtd_public_id
$pubid = $parser->dtd_public_id( $doc );
For an XML::LibXML::Document which has been returned by HTML::HTML5::Parser, using this method will tell you the Public Identifier of the DTD used (if any).
dtd_system_id
$sysid = $parser->dtd_system_id( $doc );
For an XML::LibXML::Document which has been returned by HTML::HTML5::Parser, using this method will tell you the System Identifier of the DTD used (if any).
source_line
($line, $col) = $parser->source_line( $node ); $line = $parser->source_line( $node );
In scalar context, source_line returns the line number of the source code that started a particular node (element, attribute or comment).
In list context, returns a line/column pair.
THIS FUNCTION USUALLY DOESN'T WORK.
http://suika.fam.cx/www/markup/html/whatpm/Whatpm/HTML.html
Toby Inkster, <tobyink@cpan.org>
Copyright (C) 2007-2009 by Wakaba
Copyright (C) 2009 by Toby Inkster
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.1 or, at your option, any later version of Perl 5 you may have available.
To install HTML::HTML5::Parser, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::HTML5::Parser
CPAN shell
perl -MCPAN -e shell install HTML::HTML5::Parser
For more information on module installation, please visit the detailed CPAN module installation guide.