The Perl Advent Calendar needs more articles for 2022. Submit your idea today!

NAME

HTML::WordTagRatio::SmoothedRatio - Default module for determining the ratio of words to tags in a range of tokens in an HTML document.

SYNOPSIS

  use HTML::WordTagRatio::SmoothedRatio;
  use HTML::Content::HTMLTokenizer;
  use HTML::Content::ContentExtractor;
  
  my $tokenizer = new HTML::Content::HTMLTokenizer('TAG','WORD');
  
  open(HTML,"index.html");
  my $doc = join("",<HTML>);
  close(HTML);
  
  my ($word_count_arr_ref,$tag_count_arr_ref,$token_type_arr_ref,$token_hash_ref) = $tokenizer->Tokenize($doc);
  
  my $ratio = new HTML::WordTagRatio::SmoothedRatio();
    
  my $value = $ratio->RangeValue(0, @$word_count_arr_ref, 
                                $word_count_arr_ref, $tag_count_arr_ref);
                                

DESCRIPTION

HTML::WordTagRatio::SmoothedRatio computes a ratio of Words to Tags for a given range. In psuedo code, the ratio is

Words/TotalWords/(Tags + 1)/(TotalTags + 1)

Methods

  • my $ratio = new HTML::WordTagRatio::SmoothedRatio()

    Initializes HTML::WordTagRatio::SmoothedRatio

  • my $value = $ratio->RangeValue($start, $end, \@WordCount, \@TagCount)

    $value is computed as follows:

            ($WordCount[$end] - $WordCount[$start])/$WordCount[$#WordCount]/($TagCount[$end] - $TagCount[$start] + 1)/($TagCount[$#TagCount] + 1)
            

    This is the number of words in the range, divided by the total number of words in the document, divided by the number of tags in range plus one, divided by the total number of tags plus one. The plus ones compensate for ranges with no tags. $WordCount[$i] is the number of word tokens before or at the ith token in the input HTML document. $TagCount[$i] is the number of tag tokens before or at the ith token in the input HTML document.

AUTHOR

Jean Tavernier (jj.tavernier@gmail.com)

COPYRIGHT

Copyright 2005 Jean Tavernier. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

ContentExtractorDriver.pl (1), HTML::Content::ContentExtractor (3), HTML::Content::HTMLTokenizer (3), HTML::WordTagRatio::Ratio (3),HTML::WordTagRatio::WeightedRatio (3), HTML::WordTagRatio::RelativeRatio (3), HTML::WordTagRatio::ExponentialRatio (3), HTML::WordTagRatio::NormalizedRatio (3).