urhtml_fmt - Reformat HTML, indented according to structure
urhtml_fmt
urhtml_fmt [uri|file]
urhtml_fmt http://perl.org
Given the URI or the name of a file, writes it to STDOUT reformatted and indented according to the HTML structure. Missing start and end tags are supplied and comments added to indicate this. Text inside <pre> elements is not altered.
STDOUT
<pre>
urhtml_fmt tries to parse everything that is actually out there on the Web. In fact, urhtml_fmt will assume any file fed to it was intended as HTML, and will produce its best guess of the author's intent.
urhtml_fmt supplies missing start and end tags. urhtml_fmt's parser is extremely liberal in what it accepts. When its liberalization of the standards is not sufficient to make a document into valid HTML, urhtml_fmt will pick characters to treat as noise or "cruft". The parser ignores cruft in determining the structure of the document.
When urhtml_fmt adds a missing start tag, it precedes the new start tag with a comment. When urhtml_fmt adds a missing end tag, it follows the new end tag with a comment. When urhtml_fmt classifies characters as "cruft", it adds a comment to that effect before the "cruft".
pre elements receive special treatment. The contents of pre elements are not reformatted. When missing tags or cruft occur inside a pre element, the comments to that effect are placed before the <pre> start tag.
pre
The argument to urhtml_score can be either as a URI or a file name. If it starts with alphanumerics followed by a colon, it is treated as a URI. Otherwise it is treated as file name.
Given this input:
<title>Test page<tr>x<head attr="I am cruft"><p>Final graf
urhtml_fmt returns
<!-- Following start tag is replacement for a missing one --> <html> <!-- Following start tag is replacement for a missing one --> <head> <title> Test page </title> <!-- Preceding end tag is replacement for a missing one --> </head> <!-- Preceding end tag is replacement for a missing one --> <!-- Following start tag is replacement for a missing one --> <body> <!-- Following start tag is replacement for a missing one --> <table> <!-- Following start tag is replacement for a missing one --> <tbody> <tr> <!-- Following start tag is replacement for a missing one --> <td> x <!-- Next line is cruft --> <head attr="I am cruft"> <p> Final graf </p> <!-- Preceding end tag is replacement for a missing one --> </td> <!-- Preceding end tag is replacement for a missing one --> </tr> <!-- Preceding end tag is replacement for a missing one --> </tbody> <!-- Preceding end tag is replacement for a missing one --> </table> <!-- Preceding end tag is replacement for a missing one --> </body> <!-- Preceding end tag is replacement for a missing one --> </html> <!-- Preceding end tag is replacement for a missing one -->
This program is a demo of a demo. It purpose is to show how easy it is to write applications which look at the structure of web pages using Marpa::UrHTML. And the purpose of Marpa::UrHTML is to demonstrate the power of its parse engine, Marpa. Marpa::UrHTML was written in a few days, and its logic is a straightforward, natural expression of the structure of HTML.
The starting template for this code was HTML::TokeParser, by Gisle Aas. See also the acknowledgments for Marpa as a whole.
Copyright 2007-2009 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.
To install Marpa::UrHTML, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Marpa::UrHTML
CPAN shell
perl -MCPAN -e shell install Marpa::UrHTML
For more information on module installation, please visit the detailed CPAN module installation guide.