The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::TableContentParser - Do interesting things with the contents of tables.

SYNOPSIS

  use HTML::TableContentParser;
  $p = HTML::TableContentParser->new();
  $tables = $p->parse($html);

DESCRIPTION

This package pulls out the contents of a table from a string containing HTML. Each time a table is encountered, data will be stored in an array consisting of a hash of whatever was discovered about the table -- id, name, border, cell spacing etc, and of course data contained within the table.

Tables appear in the output in the order in which they are encountered. If a table is nested inside a cell of another table, it will appear after the containing table in the output, and any connection between the two will be lost. As of version 0.200_01, the appearance of a nested table should not cause any truncation of the containing table.

The format of each hash will look something like

  attributes            keys from the attributes of the <table> tag
  @{$table_headers}     array of table headers, in order found
  @{$table_rows}        rows discovered, in order

If the table has a caption, this will be provided as

  caption               keys from the caption tag's attributes
    data                the text of the <caption>..</caption> element

then for each table row, @{$table_data} td's found, in order other attributes the ... in <tr ...>

then for each data cell, data what comes between <td> and </td> other attributes the ... in <td ...>

EXAMPLE

  use HTML::TableContentParser;
  $p = HTML::TableContentParser->new();
        $html = read_html_from_somewhere();
  $tables = $p->parse($html);
  for $t (@$tables) {
    for $r (@{$t->{rows}}) {
                        print "Row: ";
      for $c (@{$r->{cells}}) {
        print "[$c->{data}] ";
      }
      print "\n";
    }
  }

METHODS

start($parser, $tag, $attr, $attrseq, $origtext);

Called whenever a particular start tag has been recognised. This is called automatically by the parser and should not be called from the application.

text($parser, $content);

Called whenever a piece of content is encountered. This is called automatically by the parser and should not be called from the application.

end($parser, $tag, $origtext);

Called whenever a particular end tag is encountered. This is called automatically by the parser and should not be called from the application.

$tables_ref = $p->parse($html);

Called with the HTML to parse. This is all the application needs to do. The return value will be an arrayref containing each table encountered, in the format detailed above.

This method will croak() if the argument is not defined, or not specified.

DEBUG

Not a method, but a class variable. Set to 1 to cause debugging output (basically the structure and content of the table) to be sent to stdout via warn().

EXPORTS

Nothing.

CAVEATS, BUGS, and TODO

SEE ALSO

This module is a very specific tool to address a very specific problem. One of the following modules may better address your needs.

HTML::Parser. This is a general HTML parser, which forms the basis for this module.

HTML::TreeBuilder. This is a general HTML parser, with methods to search and traverse the parse tree once generated.

Mojo::DOM in the Mojolicious distribution. This is a general HTML/XML DOM parser, with methods to search the parse tree using CSS selectors.

AUTHOR

Simon Drabble <sdrabble@cpan.org>

Thomas R. Wyant, III wyant at cpan dot org

COPYRIGHT AND LICENSE

Copyright (C) 2002 Simon Drabble

Copyright (C) 2017-2018 Thomas R. Wyant, III

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0. For more details, see the full text of the licenses in the directory LICENSES.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.