- THE COMPLEXITY METRIC
- LICENSE AND COPYRIGHT
urhtml_stats - Show complexity metric and other stats for web page
Given a URI, parses it as HTML and prints a complexity metric and other statistics. The complexity metric is the average depth (or nesting level), in elements, of a character, divided by the logarithm of the length of the HTML.
Other statistics follow, formatted as an HTML table. There is a row for each element type, with
The maximum nesting depth of that element (this time only taking into account nesting within that particular element).
A count of the elements of that kind in the document
The total number of character in elements of that type. This counts characters in nested elements multiple times. For example, if a page contains a table within a table, characters in the inner table will be counted twice.
The average size of elements of this type, in characters.
Here is the first part of the output for the
http://perl.org Complexity = 0.746 Element Maximum Number of Size in Average Nesting Elements Characters Size a 1 56 3634 64 body 1 1 12171 12171 div 5 30 33605 1120 em 1 1 13 13 h1 1 1 60 60 h4 1 11 932 84
I originally was tempted to call the complexity metric a "quality metric", but decided that was going too far. Well designed websites often have low numbers, but high numbers don't mean low quality -- it depends on what the mission is, and how well complexity is being leveraged to serve that mission.
To obtain the complexity metric, the nesting depth of the average character is divided by the logarithm of the length of the HTML. This the idea is that as a web page grows, all else being equal, it is reasonable for the nesting depth to grow logarithmically, but no faster.
How seriously should you take any of this? I am frankly not sure. The main purpose of this program was not to analyze web pages, but to draw attention to the underlying technology. Speaking of which ...
This program is a demo of a demo. It purpose is to show how easy it is to write applications which look at the structure of web pages using Marpa::UrHTML. And the purpose of Marpa::UrHTML is to demonstrate the power of its parse engine, Marpa.
Determining the structure of an HTML document has in the past been considered a very difficult programming task, requiring lots of special case coding. Marpa::UrHTML was written in a few days, and the resulting grammar and code is natural and straight-forward.
At this stage of its development, other parsers have advantages over Marpa::UrHTML. But they need to be perfect. Because the code in them is an excruciatingly complex set of ad hoc solutions to special cases, other parsers are very hard to understand, and therefore to modify.
As the documentation will show, the HTML parsing logic in Marpa::UrHTML, is straightforward and an extremely natural way of expressing the problem. The transparency of Marpa::UrHTML is made possible by Marpa.
Please report any bugs or feature requests to
bug-parse-marpa at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Marpa. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
You can find documentation for this module with the perldoc command.
You can also look for information at:
AnnoCPAN: Annotated CPAN documentation
RT: CPAN's request tracker
The starting template for this code was HTML::TokeParser, by Gisle Aas.
Copyright 2007-2009 Jeffrey Kegler, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.