HTML::WordTagRatio::SmoothedRatio - Default module for determining the ratio of words to tags in a range of tokens in an HTML document.
use HTML::WordTagRatio::SmoothedRatio; use HTML::Content::HTMLTokenizer; use HTML::Content::ContentExtractor; my $tokenizer = new HTML::Content::HTMLTokenizer('TAG','WORD'); open(HTML,"index.html"); my $doc = join("",<HTML>); close(HTML); my ($word_count_arr_ref,$tag_count_arr_ref,$token_type_arr_ref,$token_hash_ref) = $tokenizer->Tokenize($doc); my $ratio = new HTML::WordTagRatio::SmoothedRatio(); my $value = $ratio->RangeValue(0, @$word_count_arr_ref, $word_count_arr_ref, $tag_count_arr_ref);
HTML::WordTagRatio::SmoothedRatio computes a ratio of Words to Tags for a given range. In psuedo code, the ratio is
Words/TotalWords/(Tags + 1)/(TotalTags + 1)
my $ratio = new HTML::WordTagRatio::SmoothedRatio()
Initializes HTML::WordTagRatio::SmoothedRatio
my $value = $ratio->RangeValue($start, $end, \@WordCount, \@TagCount)
$value is computed as follows:
($WordCount[$end] - $WordCount[$start])/$WordCount[$#WordCount]/($TagCount[$end] - $TagCount[$start] + 1)/($TagCount[$#TagCount] + 1)
This is the number of words in the range, divided by the total number of words in the document, divided by the number of tags in range plus one, divided by the total number of tags plus one. The plus ones compensate for ranges with no tags. $WordCount[$i] is the number of word tokens before or at the ith token in the input HTML document. $TagCount[$i] is the number of tag tokens before or at the ith token in the input HTML document.
Jean Tavernier (jj.tavernier@gmail.com)
Copyright 2005 Jean Tavernier. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
ContentExtractorDriver.pl (1), HTML::Content::ContentExtractor (3), HTML::Content::HTMLTokenizer (3), HTML::WordTagRatio::Ratio (3),HTML::WordTagRatio::WeightedRatio (3), HTML::WordTagRatio::RelativeRatio (3), HTML::WordTagRatio::ExponentialRatio (3), HTML::WordTagRatio::NormalizedRatio (3).
To install HTML::WordTagRatio::Ratio, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::WordTagRatio::Ratio
CPAN shell
perl -MCPAN -e shell install HTML::WordTagRatio::Ratio
For more information on module installation, please visit the detailed CPAN module installation guide.