NAME

HTML::ContentExtractor - extract the main content from a web page by analysising the DOM tree!

VERSION

Version 0.01

SYNOPSIS

    use HTML::ContentExtractor;
    my $extractor = HTML::ContentExtractor->new();
    my $agent=LWP::UserAgent->new;

    my $url='http://sports.sina.com.cn/g/2007-03-23/16572821174.shtml';
    my $res=$agent->get($url);
    my $HTML = $res->decoded_content();

    $extractor->extract($url,$HTML);
    print $extractor->as_html();
    print $extractor->as_text();

DESCRIPTION

Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions.

A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed.

Please notice the input HTML should be encoded in utf-8 format( so do the spam words), thus the module can handle web pages in any language (I've used it to process English, Chinese, and Japanese web pages).

$e = HTML::ContentExtractor->new(%options);: Constructs a new HTML::ContentExtractor object. The optional %options hash can be used to set the options list below.
$e->table_tags();
$e->table_tags(@tags);: This is used to get/set the table tags array. The tags are used as the container tags.
$e->ignore_tags();
$e->ignore_tags(@tags);: This is used to get/set the ignore tags array. The elements of such tags will be removed.
$e->spam_words();
$e->spam_words(@strings);: This is used to get/set the spam words list. The elements have such string will be removed.
$e->link_text_ratio();
$e->link_text_ratio($ratio);: This is used to get/set the link/text ratio, default is 0.05.
$e->min_text_len();
$e->min_text_len($len);: This is used to get/set the min text length, default is 20. If length of the text of an elment is less than this value, this element will be removed.
$e->extract($url,$HTML);: This is used to perform the extraction process. Please notice the input $HTML must be encoded in UTF-8.
$e->as_html();: Return the extraction result in HTML format.
$e->as_text();: Return the extraction result in text format.

AUTHOR

Zhang Jun, <jzhang533 at gmail.com>

COPYRIGHT & LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install HTML::ContentExtractor, copy and paste the appropriate command in to your terminal.

cpanm

cpanm HTML::ContentExtractor

CPAN shell

perl -MCPAN -e shell
install HTML::ContentExtractor

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)