HTML::ExtractMain - Extract the main content of a web page
Version 0.62
use HTML::ExtractMain qw( extract_main_html ); my $html = <<'END'; <div id="header">Header</div> <div id="nav"><a href="/">Home</a></div> <div id="body"> <p>Foo</p> <p>Baz</p> </div> <div id="footer">Footer</div> END my $main_html = extract_main_html($html); if (defined $main_html) { # do something with $main_html here # $main_html is '<div id="body"><p>Foo</p><p>Baz</p></div>' }
extract_main_html is optionally exported
extract_main_html
extract_main_html takes HTML content, and uses the Readability algorithm to detect the main body of the page, usually skipping headers, footers, navigation, etc.
It takes a single argument, either an HTML string, or an HTML::TreeBuilder tree. (If passed a tree, the tree will be modified and destroyed.)
If the HTML's main content is found, it's returned as an XHTML snippet. The returned HTML will not look like what you put in. (Source formatting, e.g. indentation, will be removed, and you may get back XHTML when you put in HTML.)
If a most relevant block of content is not found, extract_main_html returns undef.
Anirvan Chatterjee, <anirvan at cpan.org>
<anirvan at cpan.org>
Please report any bugs or feature requests to bug-html-extractmain at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-ExtractMain. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-html-extractmain at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc HTML::ExtractMain
You can also look for information at:
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-ExtractMain
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/HTML-ExtractMain
CPAN Ratings
http://cpanratings.perl.org/d/HTML-ExtractMain
Search CPAN
http://search.cpan.org/dist/HTML-ExtractMain/
HTML::Feature
HTML::ExtractContent
The Readability algorithm is ported from Arc90's JavaScript original, built as part of the excellent Readability application, online at http://lab.arc90.com/experiments/readability/, repository at http://code.google.com/p/arc90labs-readability/.
Copyright 2009-2010 Anirvan Chatterjee, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install HTML::ExtractMain, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::ExtractMain
CPAN shell
perl -MCPAN -e shell install HTML::ExtractMain
For more information on module installation, please visit the detailed CPAN module installation guide.