urhtml_fmt - Reformat HTML, indented according to structure
urhtml_fmt
urhtml_fmt [uri|file]
urhtml_fmt http://perl.org
Given the URI or the name of a file, writes it to STDOUT reformatted and indented according to the HTML structure. Missing start and end tags are supplied and comments added to indicate this. Text inside <pre> elements is not altered.
STDOUT
<pre>
urhtml_fmt tries to parse everything that is actually out there on the Web. In fact, urhtml_fmt will assume any file fed to it was intended as HTML, and will produce its best guess of the author's intent.
urhtml_fmt supplies missing start and end tags. urhtml_fmt's parser is extremely liberal in what it accepts. When its liberalization of the standards is not sufficient to make a document into valid HTML, urhtml_fmt will pick characters to treat as noise or "cruft". The parser ignores cruft in determining the structure of the document.
When urhtml_fmt adds a missing start tag, it precedes the new start tag with a comment. When urhtml_fmt adds a missing end tag, it follows the new end tag with a comment. When urhtml_fmt classifies characters as "cruft", it adds a comment to that effect before the "cruft".
pre elements receive special treatment. The contents of pre elements are not reformatted. When missing tags or cruft occur inside a pre element, the comments to that effect are placed before the <pre> start tag.
pre
The argument to urhtml_score can be either as a URI or a file name. If it starts with alphanumerics followed by a colon, it is treated as a URI. Otherwise it is treated as file name.
Given this input:
<title>Test page<tr>x<head attr="I am cruft"><p>Final graf
urhtml_fmt returns
<!-- Following start tag is replacement for a missing one --> <html> <!-- Following start tag is replacement for a missing one --> <head> <title> Test page </title> <!-- Preceding end tag is replacement for a missing one --> </head> <!-- Preceding end tag is replacement for a missing one --> <!-- Following start tag is replacement for a missing one --> <body> <!-- Following start tag is replacement for a missing one --> <table> <!-- Following start tag is replacement for a missing one --> <tbody> <tr> <!-- Following start tag is replacement for a missing one --> <td> x <!-- Next line is cruft --> <head attr="I am cruft"> <p> Final graf </p> <!-- Preceding end tag is replacement for a missing one --> </td> <!-- Preceding end tag is replacement for a missing one --> </tr> <!-- Preceding end tag is replacement for a missing one --> </tbody> <!-- Preceding end tag is replacement for a missing one --> </table> <!-- Preceding end tag is replacement for a missing one --> </body> <!-- Preceding end tag is replacement for a missing one --> </html> <!-- Preceding end tag is replacement for a missing one -->
This program is a demo of a demo. It purpose is to show how easy it is to write applications which look at the structure of web pages using Marpa::UrHTML. And the purpose of Marpa::UrHTML is to demonstrate the power of its parse engine, Marpa. Marpa::UrHTML was written in a few days, and its logic is a straightforward, natural expression of the structure of HTML.
The starting template for this code was HTML::TokeParser, by Gisle Aas. See also the acknowledgments for Marpa as a whole.
Copyright 2007-2009 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.
To install Marpa, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Marpa
CPAN shell
perl -MCPAN -e shell install Marpa
For more information on module installation, please visit the detailed CPAN module installation guide.