++ed by:
PERLANCAR DRTECH DOUGDUDE AZAWAWI

4 PAUSE users
1 non-PAUSE user.

Author image Jeffrey Kegler
and 1 contributors

NAME

urhtml_stats - Show complexity metric and other stats for web page

SYNOPSIS

    ur_html_stats [uri]

EXAMPLE

    urhtml_stats http://perl.org

DESCRIPTION

Given a URI, parses it as HTML and prints a complexity metric and other statistics. The complexity metric is the average depth (or nesting level), in elements, of a character, divided by the logarithm of the length of the HTML.

Other statistics follow, formatted as an HTML table. There is a row for each element type, with

  • The maximum nesting depth of that element (this time only taking into account nesting within that particular element).

  • A count of the elements of that kind in the document

  • The total number of character in elements of that type. This counts characters in nested elements multiple times. For example, if a page contains a table within a table, characters in the inner table will be counted twice.

  • The average size of elements of this type, in characters.

Here is the first part of the output for the http://perl.org.

   http://perl.org
   Complexity = 0.746

   Element Maximum Number of   Size in   Average
           Nesting Elements  Characters   Size
   a             1        56      3634      64
   body          1         1     12171   12171
   div           5        30     33605    1120
   em            1         1        13      13
   h1            1         1        60      60
   h4            1        11       932      84

THE COMPLEXITY METRIC

I originally was tempted to call the complexity metric a "quality metric", but decided that was going too far. Well designed websites often have low numbers, but high numbers don't mean low quality -- it depends on what the mission is, and how well complexity is being leveraged to serve that mission.

To obtain the complexity metric, the nesting depth of the average character is divided by the logarithm of the length of the HTML. This the idea is that as a web page grows, all else being equal, it is reasonable for the nesting depth to grow logarithmically, but no faster.

How seriously should you take any of this? I am frankly not sure. The main purpose of this program was not to analyze web pages, but to draw attention to the underlying technology. Speaking of which ...

PURPOSE

This program is a demo of a demo. It purpose is to show how easy it is to write applications which look at the structure of web pages using Marpa::UrHTML. And the purpose of Marpa::UrHTML is to demonstrate the power of its parse engine, Marpa.

Determining the structure of an HTML document has in the past been considered a very difficult programming task, requiring lots of special case coding. Marpa::UrHTML was written in a few days, and the resulting grammar and code is natural and straight-forward.

At this stage of its development, other parsers have advantages over Marpa::UrHTML. But they need to be perfect. Because the code in them is an excruciatingly complex set of ad hoc solutions to special cases, other parsers are very hard to understand, and therefore to modify.

As the documentation will show, the HTML parsing logic in Marpa::UrHTML, is straightforward and an extremely natural way of expressing the problem. The transparency of Marpa::UrHTML is made possible by Marpa.

AUTHOR

Jeffrey Kegler

BUGS

Please report any bugs or feature requests to bug-parse-marpa at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Marpa. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Marpa
    

You can also look for information at:

ACKNOWLEDGMENTS

The starting template for this code was HTML::TokeParser, by Gisle Aas.

LICENSE AND COPYRIGHT

Copyright 2007-2009 Jeffrey Kegler, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.