The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

HTML::Parser::Simple - Parse nice HTML files without needing a compiler

Synopsis

        #!/usr/bin/perl

        use strict;
        use warnings;

        use HTML::Parser::Simple;

        # -------------------------

        # Method 1:

        my($p) = HTML::Parser::Simple -> new
        (
         {
                input_dir  => '/source/dir',
                output_dir => '/dest/dir',
         }
        );

        $p -> parse_file('in.html', 'out.html');

        # Method 2:

        my($p) = HTML::Parser::Simple -> new();

        $p -> parse('<html>...</html>');
        $p -> traverse($p -> get_root() );
        print $p -> result();

Description

HTML::Parser::Simple is a pure Perl module.

It parses HTML V 4 files, and generates a tree of nodes per HTML tag.

The data associated with each node is documented in the FAQ.

Distributions

This module is available as a Unix-style distro (*.tgz).

See http://savage.net.au/Perl-modules.html for details.

See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing.

Constructor and initialization

new(...) returns an object of type HTML::Parser::Simple.

This is the class's contructor.

Usage: HTML::Parser::Simple -> new().

This method takes a hashref of options.

Call new() as new({option_1 => value_1, option_2 => value_2, ...}).

Available options:

input_dir

This takes the path where the input file is to read from.

The default value is '' (the empty string).

output_dir

This takes the path where the output file is to be written.

The default value is '' (the empty string).

verbose

This takes either a 0 or a 1.

Write more or less progress messages to STDERR.

The default value is 0.

Note: Currently, setting verbose does nothing.

xhtml

This takes either a 0 or a 1.

0 means do not accept an XML declaration, such as <?xml version="1.0" encoding="UTF-8"?> at the start of the input file, and some other XHTML features.

1 means accept it.

The default value is 0.

Warning: The only XHTML changes to this code, so far, are:

Accept the XML declaration

E.g.: <?xml version="1.0" standalone='yes'?>.

Accept attribute names containing the ':' char

E.g.: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">.

Method: get_current_node()

Returns the Tree::Simple object which the parser calls the current node.

Method: get_depth()

Returns the nesting depth of the current tag.

It's just there in case you need it.

Method: get_input_dir()

Returns the input_dir parameter, as passed in to new().

Method: get_output_dir()

Returns the output_dir parameter, as passed in to new().

Method: get_node_type()

Returns the type of the most recently created node, 'global', 'head', or 'body'.

See the first question in the FAQ for details.

Method: result()

Returns the result so far of the parse.

Method: get_root()

Returns the node which the parser calls the root of the tree of nodes.

Method: get_verbose()

Returns the verbose parameter, as passed in to new().

Method: get_xhtml()

Returns the xhtml parameter, as passed in to new().

Method: log($msg)

Print $msg to STDERR if new() was called as new({verbose => 1}), or if $p -> set_verbose(1) was called.

Otherwise, print nothing.

Method: parse($html)

Parses the string of HTML in $html, and builds a tree of nodes.

After calling $p -> parse(), you must call $p -> traverse($p -> get_root() ) before calling $p -> result().

Alternately, call $p -> parse_file(), which calls all these methods for you.

Note: parse() may be called directly or via parse_file().

Method: parse_file($input_file_name, $output_file_name)

Parses the HTML in the input file, and writes the result to the output file.

Method: result()

Returns the result so far of the parse.

Method: set_current_node($node)

Sets the node which the parser calls the current node.

Returns undef.

Method: set_depth($depth)

Sets the nesting depth of the current node.

Returns undef.

It's just there in case you need it.

Method: set_input_dir($dir_name)

Sets the input_dir parameter, as though it was passed in to new().

Returns undef.

Method: set_output_dir($dir_name)

Sets the output_dir parameter, as though it was passed in to new().

Returns undef.

Method: set_node_type($node_type)

Sets the type of the next node to be created, 'global', 'head', or 'body'.

See the first question in the FAQ for details.

Returns undef.

Method: set_root($node)

Returns the node which the parser calls the root of the tree of nodes.

Returns undef.

Method: set_verbose($Boolean)

Sets the verbose parameter, as though it was passed in to new().

Returns undef.

Method: set_xhtml($Boolean)

Sets the xhtml parameter, as though it was passed in to new().

Returns undef.

FAQ

What is the format of the data stored in each node of the tree?

The data of each node is a hash ref. The keys/values of this hash ref are:

attributes

This is the string of HTML attributes associated with the HTML tag.

So, <table align = 'center' bgColor = '#80c0ff' summary = 'Body'> will have an attributes string of " align = 'center' bgColor = '#80c0ff' summary = 'Body'".

Note the leading space.

content

This is an array ref of bits and pieces of content.

Consider this fragment of HTML:

<p>I did <i>not</i> say I <i>liked</i> debugging.</p>

When parsing 'I did ', the number of child nodes (of <p>) is 0, since <i> has not yet been detected.

So, 'I did ' is stored in the 0th element of the array ref.

Likewise, 'not' is stored in the 0th element of the array ref belonging to the node 'i'.

Next, ' say I ' is stored in the 1st element of the array ref, because it follows the 1st child node (<i>).

Likewise, ' debugging' is stored in the 2nd element.

This way, the input string can be reproduced by successively outputting the elements of the array ref of content interspersed with the contents of the child nodes (processed recusively).

Note: If you are processing this tree, never forget that there can be content after the last child node has been closed, but before the current node is closed.

Note: The DOCTYPE declaration is stored as the 0th element of the content of the root node.

depth

The nesting depth of the tag within the document.

The root is at depth 0, '<html>' is at depth 1, '<head>' and '<body>' are a depth 2, and so on.

It's just there in case you need it.

The name the HTML tag

So, the tag '<html>' will mean the name is 'html'.

The root of the tree is called 'root', and holds the DOCTYPE, if any, as content.

The root has the node 'html' as the only child, of course.

node_type

This holds 'global' before '<head>' and between '</head>' and '<body>', and after '</body>'.

It holds 'head' for all nodes from '<head>' to '</head>', and holds 'body' from '<body>' to '</body>'.

It's just there in case you need it.

How are HTML comments handled?

They are treated as content. This includes the prefix '<!--' and the suffix '-->'.

How is DOCTYPE handled?

It is treated as content belonging to the root of the tree.

How is the XML declaration handled?

It is treated as content belonging to the root of the tree.

Does this module handle all HTML pages?

No, never.

Which versions of HTML does this module handle?

Up to V 4.

What do I do if this module does not handle my HTML page?

Make yourself a nice cup of tea, and then fix your page.

Does this validate the HTML input?

No.

For example, if you feed in a HTML page without the title tag, this module does not care.

How do I view the output HTML?

By installing HTML::Revelation, of course!

Sample output:

http://savage.net.au/Perl-modules/html/CreateTable.html

How do I test this module (or my file)?

Suggested steps:

Note: There are quite a few files involved. Proceed with caution.

Select a HTML file to test

Call this input.html.

Run input.html thru reveal.pl

Reveal.pl ships with HTML::Revelation.

Call the output file output.1.html.

Run input.html thru parse.html.pl

Parse.html.pl ships with HTML::Parser::Simple.

Call the output file parsed.html.

Run parsed.html thru reveal.pl

Call the output file output.2.html.

Compare output.1.html and output.2.html

If they match, or even if they don't match, you're finished.

Will you implement a 'quirks' mode to handle my special HTML file?

No, never.

Help with quirks:

http://www.quirksmode.org/sitemap.html

Is there anything I should be aware of?

Yes. If your HTML file is not nice, the interpretation of tag nesting will not match your preconceptions.

In such cases, do not seek to fix the code. Instead, fix your (faulty) preconceptions, and fix your HTML file.

The 'a' tag, for example, is defined to be an inline tag, but the 'div' tag is a block-level tag.

I don't define 'a' to be inline, others do, e.g. http://www.w3.org/TR/html401/ and hence HTML::Tagset.

Inline means:

        <a href = "#NAME"><div class = 'global_toc_text'>NAME</div></a>

will not be parsed as an 'a' containing a 'div'.

The 'a' tag will be closed before the 'div' is opened. So, the result will look like:

        <a href = "#NAME"></a><div class = 'global_toc_text'>NAME</div>

To achieve what was presumably intended, use 'span':

        <a href = "#NAME"><span class = 'global_toc_text'>NAME</span></a>

Some people (*cough* *cough*) have had to redo their entire websites due to this very problem.

Of course, this is just one of a vast set of possible problems.

You have been warned.

Why did you use Tree::Simple but not Tree or Tree::Fast or Tree::DAG_Node?

During testing, Tree::Fast crashed, so I replaced it with Tree and everything worked. Spooky.

Late news: Tree does not cope with an array ref stored in the metadata, so I've switched to Tree::DAG_Node.

Stop press: As an experiment I switched to Tree::Simple. Since it also works I'll just keep using it.

Why isn't this module called HTML::Parser::PurePerl?
The API

That name sounds like a pure Perl version of the same API as used by HTML::Parser.

But the API's are not, and are not meant to be, compatible.

The tie-in

Some people might falsely assume HTML::Parser can automatically fall back to HTML::Parser::PurePerl in the absence of a compiler.

How do I output my own stuff while traversing the tree?
The sophisticated way

As always with OO code, sub-class! In this case, you write a new version of the traverse() method.

The crude way

Alternately, implement another method in your sub-class, e.g. process(), which recurses like traverse(). Then call parse() and process().

Is the code on github?

Yes. See: git://github.com/ronsavage/html--parser--simple.git

How is the source formatted?

I edit with Emacs, using the default formatting for Perl.

That means, in general, leading 4-space tabs. Hashrefs use a leading tab and then a space.

All vertical alignment within lines is done manually with spaces.

Perl::Critic is off the agenda.

Credits

This Perl HTML parser has been converted from a JavaScript one written by John Resig.

http://ejohn.org/files/htmlparser.js

Well done John!

Note also the comments published here:

http://groups.google.com/group/envjs/browse_thread/thread/edd9033b9273fa58

Author

HTML::Parser::Simple was written by Ron Savage <ron@savage.net.au> in 2009.

Home page: http://savage.net.au/index.html

Copyright

Australian copyright (c) 2009 Ron Savage.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html