The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::TableContentParser - Do interesting things with the contents of tables.

SYNOPSIS

  use HTML::TableContentParser;
  my $p = HTML::TableContentParser->new();
  my $html = read_html_from_somewhere();
  my $tables = $p->parse( $html );
  for my $t (@$tables) {
    for my $r (@{$t->{rows}}) {
      print 'Row:';
      for my $c (@{$r->{cells}}) {
        print " [$c->{data}]";
      }
      print "\n";
    }
  }

DESCRIPTION

This package parses tables out of HTML. The return from the parse is a reference to an array containing the tables found.

Tables appear in the output in the order in which they are encountered. If a table is nested inside a cell of another table, it will appear after the containing table in the output, and any connection between the two will be lost. As of version 0.200_01, the appearance of a nested table should not cause any truncation of the containing table.

The following tags are processed by this module: <table>, <caption>, <tr>, <th>, and <td>. In the return from the parse method, each tag is represented by a hash reference, having the tag's attributes as keys, and the attribute values as values. In addition, the following keys will be provided:

<table>
caption

the <caption> tag, if any

headers

a reference to an array containing all the <th> tags, in the order encountered

rows

a reference to an array containing all the <tr> tags, in the order encountered

<caption>
data

the content of the <caption> tag

<tr>
cells

a reference to an array containing all the <td> tags, in the order encountered, with undef representing any <th> tags encountered. Trailing undef values will be dropped, and the entire key will be absent unless actual <td> tags are found in the row.

Note that prior to version 0.299_01, <th> tags were not represented at all.

headers

new with version 0.299_01, this is a reference to an array containing all the <th> tags in the row, in the order encountered, with undef representing any <td> tags. Trailing undef values will be dropped, and the entire key will be absent unless actual <th> tags are found in the row.

It is the understanding of the current author (TRW) that in valid HTML <th> tags must occur inside a <tr> element, so they need to be recognized there, rather than (or in addition to) in isolation.

<th>
data

the content of the <th> tag

<td>
data

the content of the <td> tag

METHODS

This module is a subclass of HTML::Parser. It provides only one new method, classic(), which is an accessor for the attribute of the same name. The following inherited (or overridden) methods may profitably be called by the user.

new

 my $p = HTML::TableContentParser->new();

This static method instantiates the parser object. The only supported argument is

classic

If this argument is set to 1, <th> tags are handled in the pre-0.299_01 way. That is, the <tr> hash will not contain a {headers} key, and its {cells} key will not contain any undef values corresponding to <th> elements.

If this argument is set to 0, you get the behavior documented for 0.299_01 and after.

If this argument is undef or omitted, the value of $HTML::TableContentParser::CLASSIC is used.

No other values are supported -- that is, the author reserves them, and the behavior when you use them may change without warning.

classic

This method returns the value of the classic attribute, whether specified or defaulted.

parse

 my $tables = $p->parse( $html );

This method parses the given HTML. The return is a reference to an array containing all the tables found.

GLOBALS

The following global variables, properly localized, can be used to modify the behavior of this module.

$HTML::TableContentParser::CLASSIC

This variable provides the default value of the classic argument to new(), and is subject to the same restrictions.

$HTML::TableContentParser::DEBUG

If set to 1, causes debug output to STDERR (via warn()). Setting this to any true value (including 1) is unsupported in the sense that the behavior of this module in response to any true value is explicitly undocumented, and can change without notice.

EXPORTS

Nothing.

CAVEATS, BUGS, and TODO

The rowspan and colspan attributes are reported but ignored. That is,

 <tr><td colspan="2">Moe</td><td>Howard</td></tr>

occupies three columns in the HTML table, but only two entries are made in the {cells} value of the hash that represents this row.

Please file bug reports at https://rt.cpan.org/Public/Dist/Display.html?Name=HTML-TableContentParser, https://github.com/trwyant/perl-HTML-TableContentParser/issues, or in electronic mail to wyant at cpan dot org.

SEE ALSO

This module is a very specific tool to address a very specific problem. One of the following modules may better address your needs.

HTML::Parser. This is a general HTML parser, which forms the basis for this module.

HTML::TreeBuilder. This is a general HTML parser, with methods to search and traverse the parse tree once generated.

Mojo::DOM in the Mojolicious distribution. This is a general HTML/XML DOM parser, with methods to search the parse tree using CSS selectors.

AUTHOR

Simon Drabble <sdrabble@cpan.org>

Thomas R. Wyant, III wyant at cpan dot org

COPYRIGHT AND LICENSE

Copyright (C) 2002 Simon Drabble

Copyright (C) 2017-2021 Thomas R. Wyant, III

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0. For more details, see the full text of the licenses in the directory LICENSES.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.