The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

HTML::Parser::Simple - Parse nice HTML files without needing a compiler

Synopsis

        #!/usr/bin/perl
        
        use strict;
        use warnings;
        
        use HTML::Parser::Simple;
        
        # -------------------------
        
        # Method 1:
        
        my($p) = HTML::Parser::Simple -> new
        (
         {
                input_dir  => '/source/dir',
                output_dir => '/dest/dir',
         }
        );
        
        $p -> parse_file('in.html', 'out.html');
        
        # Method 2:
        
        my($p) = HTML::Parser::Simple -> new();
        
        $p -> parse('<html>...</html>');
        $p -> traverse($p -> root() );
        print $p -> result();

Description

HTML::Parser::Simple is a pure Perl module.

It parses HTML V 4 files, and generates a tree of nodes per HTML tag.

The data associated with each node is documented in the FAQ.

Warning: Use only the documented methods.

Distributions

This module is available as a Unix-style distro (*.tgz).

See http://savage.net.au/Perl-modules.html for details.

See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing.

Constructor and initialization

new(...) returns an object of type HTML::Parser::Simple.

This is the class's contructor.

Usage: HTML::Parser::Simple -> new().

This method takes a hashref of options.

Call new() as new({option_1 => value_1, option_2 => value_2, ...}).

Available options:

input_dir

This takes the path where the input file is to read from.

The default value is '' (the empty string).

output_dir

This takes the path where the output file is to be written.

The default value is '' (the empty string).

verbose

This takes either a 0 or a 1.

Write more or less progress messages to STDERR.

The default value is 0.

Note: Currently, setting verbose does nothing.

Method: log($msg)

Print $msg to STDERR if new() was called as new({verbose => 1}).

Otherwise, print nothing.

Method: parse($html)

Parses the string of HTML in $html, and builds a tree of nodes.

After calling $p -> parse(), you must call $p -> traverse($p -> root() ) before calling $p -> result().

Alternately, call $p -> parse_file(), which calls all these methods for you.

Method: parse_file($input_file_name, $output_file_name)

Parses the HTML in the input file, and writes the result to the output file.

Method: result()

Returns a string which is the result of calling $p -> traverse($p -> root() ).

Method: root()

Returns the root of the tree constructed by calling $p -> parse().

Note: parse() may be called directly or via parse_file().

FAQ

What is the format of the data stored in each node of the tree?

The data of each node is a hash ref:

The keys/values of this hash ref are:

attributes

This is the string of HTML attributes associated with the HTML tag.

So, <table align = 'center' bgColor = '#80c0ff' summary = 'Body'> will have an attributes string of " align = 'center' bgColor = '#80c0ff' summary = 'Body'".

Note the leading space.

content

This is an array ref of bits and pieces of content.

Consider this fragment of HTML:

<p>I did <i>not</i> say I <i>liked</i> debugging.</p>

When parsing 'I did ', the number of child nodes (of <p>) is 0, since <i> has not yet been detected.

So, 'I did ' is stored in the 0th element of the array ref.

Likewise, 'not' is stored in the 0th element of the array ref belonging to the node 'i'.

Next, ' say I ' is stored in the 1st element of the array ref, because it follows the 1st child node (<i>).

Likewise, ' debugging' is stored in the 2nd element.

This way, the input string can be reproduced by successively outputting the elements of the array ref of content interspersed with the contents of the child nodes (processed recusively).

Note: If you are processing this tree, never forget that there can be content after the last child node has been closed, but before the current node is closed.

Note: The DOCTYPE declaration is stored as the 0th element of the content of the root node.

The name the HTML tag

So, the tag '<html>' will mean the name is 'html'.

The root of the tree is called 'root', and holds the DOCTYPE, if any, as content.

The root has the node 'html' as the only child, of course.

node_type

This holds 'global' before '<head>' and between '</head>' and '<body>', and after '</body>'.

It holds 'head' for all nodes from '<head>' to '</head>', and holds 'body' from '<body>' to '</body>'.

It's just there in case you need it.

How are HTML comments handled?

They are treated as content. This includes the prefix '<!--' and the suffix '-->'.

How is DOCTYPE handled?

It is treated as content belonging to the root of the tree.

Does this module handle all HTML pages?

No, never.

Which versions of HTML does this module handle?

Up to V 4.

What do I do if this module does not handle my HTML page?

Make yourself a nice cup of tea, and then fix your page.

Does this validate the HTML input?

No.

For example, if you feed in a HTML page without the title tag, this module does not care.

How do I view the output HTML?

By installing HTML::Revelation, of course!

Sample output:

http://savage.net.au/Perl-modules/html/CreateTable.html

How do I test this module (or my file)?

Suggested steps:

Note: There are quite a few files involved. Proceed with caution.

Select a HTML file to test

Call this input.html.

Run input.html thru reveal.pl

Reveal.pl ships with HTML::Revelation.

Call the output file output.1.html.

Run input.html thru parse.file.pl

Parse.file.pl ships with HTML::Parser::Simple.

Call the output file parsed.html.

Run parsed.html thru reveal.pl

Call the output file output.2.html.

Compare output.1.html and output.2.html

If they match, or even if they don't match, you're finished.

Will you implement a 'quirks' mode to handle my special HTML file?

No, never.

Help with quirks:

http://www.quirksmode.org/sitemap.html

Is there anything I should be aware of?

Yes. If your HTML file is not nice, the interpretation of tag nesting will not match your preconceptions.

In such cases, do not seek to fix the code. Instead, fix your (faulty) preconceptions, and fix your HTML file.

The 'a' tag, for example, is defined to be an inline tag, but the 'div' tab is a block-level tag.

I don't define 'a' to be inline, others do, e.g. http://www.w3.org/TR/html401/ and hence HTML::Tagset.

Inline means:

        <a href = "#NAME"><div class = 'global_toc_text'>NAME</div></a>

will not be parsed as an 'a' containing a 'div'.

The 'a' tag will be closed before the 'div' is opened. So, the result will look like:

        <a href = "#NAME"></a><div class = 'global_toc_text'>NAME</div>

To achieve what was presumably intended, use 'span':

        <a href = "#NAME"><span class = 'global_toc_text'>NAME</span></a>

Some people (*cough* *cough*) have had to redo their entire websites due to this very problem.

Of course, this is just one of a vast set of possible problems.

You have been warned.

Why did you use Tree::Simple but not Tree or Tree::Fast or Tree::DAG_Node?

During testing, Tree::Fast crashed, so I replaced it with Tree and everything worked. Spooky.

Late news: Tree does not cope with an array ref stored in the metadata, so I've switched to Tree::DAG_Node.

Stop press: As an experiment I switched to Tree::Simple. Since it also works I'll just keep using it.

Why isn't this module called HTML::Parser::PurePerl?
The API

That name sounds like a pure Perl version of the same API as used by HTML::Parser.

But the API's are not, and are not meant to be, compatible.

The tie-in

Some people might falsely assume HTML::Parser can automatically fall back to HTML::Parser::PurePerl in the absence of a compiler.

Required Modules

Carp
Tree::DAG_Node

Credits

This Perl HTML parser has been converted from a JavaScript one written by John Resig.

http://ejohn.org/files/htmlparser.js

Well done John!

Note also the comments published here:

http://groups.google.com/group/envjs/browse_thread/thread/edd9033b9273fa58

Author

HTML::Parser::Simple was written by Ron Savage <ron@savage.net.au> in 2009.

Home page: http://savage.net.au/index.html

Copyright

Australian copyright (c) 2009 Ron Savage.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html