HTML::Content::ContentExtractor - Perl module for extracting content from HTML documents.
use HTML::WordTagRatio::WeightedRatio; use HTML::Content::HTMLTokenizer; use HTML::Content::ContentExtractor; my $tokenizer = new HTML::Content::HTMLTokenizer('TAG','WORD'); my $ranker = new HTML::WordTagRatio::WeightedRatio(); my $extractor = new HTML::Content::ContentExtractor($tokenizer,$ranker,"index.html","index.extr"); $extractor->Extract();
HTML::Content::ContentExtractor attempts to extract the content from HTML documents. It attempts to remove tags, scripts and boilerplate text from the documents by trying to find the region of the HTML document that has the maximum ratio of words to tags.
my $extractor = new HTML::Content::ContentExtractor($tokenizer, $ratio, $inputfilename, $extractfilename)
Initializes HTML::Content::ContentExtractor with 1) an object that can tokenize HTML 2) an object that can compute the ratio of Words to Tags 3) an input filename and 4) an output filename.
$extractor->Extract()
Attempts to extract content from the $inputfilename.
Jean Tavernier (jj.tavernier@gmail.com)
Copyright 2005 Jean Tavernier. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
ContentExtractorDriver.pl (1).
To install HTML::WordTagRatio::Ratio, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::WordTagRatio::Ratio
CPAN shell
perl -MCPAN -e shell install HTML::WordTagRatio::Ratio
For more information on module installation, please visit the detailed CPAN module installation guide.