The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

untemplate - analyze several HTML documents based on the same template

VERSION

version 0.005

SYNOPSIS

    untemplate [options] HTML1 HTML2 [HTML3] [...]

DESCRIPTION

Takes multiple HTML documents generated using the same template and attempts to extract only the data inserted into original template.

Accepts URL if AnyEvent::Net::Curl::Queued is present.

OPTIONS

--help

This.

--[no]color

Enable syntax highlight for XPath. By default, enabled automatically on interactive terminals.

--[no]strict

Strict mode disables grouping by id, class or name attributes. The grouping is enabled by default.

--unmangle=regex

Specify regex(es) to unmangle id/class attributes. Some CMS (WordPress) insert unique identifiers into HTML elements, like:

    <body class="post-id-12345">

This tend to break HTML tree analysis. To fix the above case, use --unmangle 'post-id-\d+'. Multiple unmanglers are accepted (--unmangle a --unmangle b).

EXAMPLES

    untemplate --color http://bash.org/?1839 http://bash.org/?2486 | less -R

CAVEATS

Trying to untemplate HTML documents not based on the same template, the results will be empty.

Unfortunately, employing any kind of document identifier as part of element class/id (common practice in WordPress themes) is enough to constitute "not same template".

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.