Mail::Webmail::MessageParser -- class to parse HTML webmail messages.


        $p = new Mail::Webmail::MessageParser();

        $p->message_start(_tag => 'div', id => 'message');

        $body_text = $p->parse_body($html, $style);

        while (($field, $data) = each @html_fields_from_somewhere) {
                $header = $p->parse_header($field, $data);
                push @headers, $header if $header;


Parses header and body HTML and converts both to text, or optionally (for body text) to simpler fully-formed HTML.

The package extends HTML::TreeBuilder to include functionality for parsing email elements from an HTML string.



Sets the tokens to watch for that denote the beginning of a message. This allows email messages to be embedded within a DIV or other HTML enclosing tag, or simply just follow a particular sequence of tags.

The @message_start_tokens array is passed verbatim to the HTML::TreeBuilder/ HTML::Element functions for traversing the HTML tree. This is typically a list of items such as

  '_tag', 'a', 'href', ''

which is interpreted to mean "look for an 'anchor' tag with an 'href' parameter of '".

Since this is a list or array, I typically use the slightly easier-to-read notation of

  '_tag' => 'a', 'href' => ''
$hdr_text = $parser->parse_header($field, $data);

Attempts to find a valid Email header name in $field, and a corresponding value in $data. Potential header names are compared to those in @mail_header_names iff $field matches the $LOOKS_LIKE_A_HEADER regexp.

If a valid field name is found, the returned string contains the header in the form 'Name: Value', for example 'To: "A User" <>'. If no such field name is found, undef is returned.


Reads the body text out of $html, and stores it for later processing. This method will probably be folded into something else at a future date.

$parser->body_as_html($message); =item $parser->body_as_plain($message); =item $parser->body_as_text($message); =item $parser->body_as_appropriate($message);

Reads the (parsed and stored) message body and returns it in the specified format. Normally you would only want to call body_as_appropriate(), since this will handle the message's Content-Type correctly. The other methods are just wrappers for body().

$normalised_html = $parser->parse_body_as_html($html); =item $text = $parser->parse_body_as_text($html); =item $text = $parser->parse_body($html, $style);

Deprecated methods; will be removed in a future version.

$parser->extract_body(extraction criteria..);

Extracts the body from the currently stored message, removing the 'extraction criteria' from around it. The extraction criteria is a series of arrayrefs containing tags to pass verbatim to HTML::Element::look_down(). This method may be called serially with different criteria each time. It will return 0 if the criteria were not found, 1 otherwise.

This method may be folded into something else at a future date.

my $body = $parser->body($message, $style);

Returns the parsed-and-stored message body.

How the message is returned depends on the value (if any) in $style; the message's Content-Type (if any), and the current parsing and rendering capabilities of HTML::TreeBuilder, according to the following rules:

1. TODO: rules.

$parser->remove_matching(match criteria);

Removes the provided match criteria (in the form of a list to pass to HTML::TreeBuilder) from the message content-body. This performs no processing on the contents other than that. Note that any contained elements are removed along with the matched criteria.

my $flag = $parser->might_be_html($message, $text);

Returns 'html' if $text looks like HTML, based on the presence of a matching tag </foo> for any tag <foo>; 'plain' otherwise. If debugging is on, adds conversion info 'X-' headers to $message (which should be of type 'Mail::Internet'), if conversion is performed.

my $text = $parser->html2text($html); =item my $text = Mail::Webmail::MessageParser::html2text($html);

'Converts' the provided html into plain text. 'Converts' is in quotes because the conversion is pretty simplistic - in the worst case, <br> tags are replaced with newlines, and no other conversion is performed.

$parser->start($tagname, $attr, $attrseq, $origtext); =item $parser->end($tagname);

Override the corresponding methods in HTML::TreeBuilder, which itself override those in HTML::Parser. These methods should not be called directly from an application. They are here mainly to remove surplus HTML tags from around the message body; these tags confuse HTML::TreeBuilder and thus result in poor behaviour.




o There may be some issues with the HTML entities being decoded. o Message bodies should really be enclosed in container tags; I have not tested what happens if a non-contained tag is passed to message_start(). o Conversion from HTML to text in some cases produces very poor results. Generally it's best to let the parser figure out the most desirable output format (it gives very good results if the Content-Type is set correctly).


  Simon Drabble  E<lt>sdrabble@cpan.orgE<gt>