NAME

HTML::Detergent - Clean the gunk off an HTML document

VERSION

Version 0.06

SYNOPSIS

use HTML::Detergent;

my $scrubber = HTML::Detergent->new($config);

# $input can be a string, GLOB reference, or XML::LibXML::Document

my $doc = $scrubber->process($input, $uri);

DESCRIPTION

HTML::Detergent is for isolating the main content of an HTML page, stripping it of navigation, visual design, and other ancillary content.

The main purpose of this module is to aid in the migration of web content from one content management system to another. It is also useful for preparing HTML resources for automated content inventories.

The module currently has no heuristics for determining the main content of a page. It works instead by assuming prior knowledge of the layout, given in the configuration by an XPath expression that uniquely isolates the container node. That node is then lifted into a new document, along with the contents of the <head>, and returned by the "process" method. To accommodate multiple layouts on a site, the module can be initialized to match multiple XPath expressions. If further processing is necessary, an expression can be associated with an XSLT stylesheet, which is assumed to produce an entire document, thus overriding the default behaviour.

After the new document is generated and before it is returned by "process", it is possible to inject <link> and <meta> elements into the <head>. This enables the inclusion of metadata and the re-association of the main content with links that represent aspects of the page which have been removed (e.g. navigation, copyright statement, etc.). In addition, if the page's URI is supplied to the "process" method, the <base> element is either added or rewritten to reflect it, and the URI attributes in the body are rewritten relative to the base. Otherwise they are left alone.

The document returned is an XML::LibXML::Document object using the XHTML namespace, http://www.w3.org/1999/xhtml, but does not profess to validate against any particular schema. If DTD declarations (including the empty <!DOCTYPE html> recommended in HTML5) are desired, they can be added on afterward. Likewise, the object can be converted from XML into HTML using "toStringHTML" in XML::LibXML::Document.

METHODS

new %CONFIG | \%CONFIG | $CONFIG

Initialize the processor, either with a list of configuration parameters, a HASH reference thereof, or an HTML::Detergent::Config object. Below are the valid parameters:

match

This is an ARRAY reference of XPath expressions to try against the document, in order of preference. Entries optionally may be two-element ARRAY references themselves, the second element being a URL where an XSLT stylesheet may be found.

match => [ '/some/xpath/expression',
           [ '/other/expr', '/url/of/transform.xsl' ],
         ],

link

This is a HASH reference where the keys correspond to rel attributes and the values to href attributes of <link> elements. If the values are ARRAY references, they will be processed in document order. rel attributes will be sorted lexically. If a callback is supplied instead, the caller expects a result of the same form.

link => { rel1 => 'href1', rel2 => [ qw(href2 href3) ] },

# or

link => \&_link_cb,

meta

This is a HASH reference where the keys correspond to name attributes and the values to content attributes of <meta> elements. If the values are ARRAY references, they will be processed in document order. name attributes will be sorted lexically. If a callback is supplied instead, the caller expects a result of the same form.

meta => { name1 => 'content1',
          name2 => [ qw(content2 content3) ] },

# or

meta => \&_meta_cb,

callback

These callbacks will be passed into the internal XML::LibXSLT processor. See XML::LibXML::InputCallback for details.

callback => [ \&_match_cb, \&_open_cb, \&_read_cb, \&_close_cb ],

# or

callback => $icb, # isa XML::LibXML::InputCallback

process $INPUT [, $URI, $CONFIG ]

Processes $INPUT, which may be a string, GLOB reference, or XML::LibXML::Document object. Returns an XML::LibXML::Document object with the changes mentioned in the "DESCRIPTION".

AUTHOR

Dorian Taylor, <dorian at cpan.org>

BUGS

Please report any bugs or feature requests to bug-html-detergent at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Detergent. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc HTML::Detergent

You can also look for information at:

RT: CPAN's request tracker (report bugs here)

http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Detergent
AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/HTML-Detergent
CPAN Ratings

http://cpanratings.perl.org/d/HTML-Detergent
Search CPAN

http://search.cpan.org/dist/HTML-Detergent/

LICENSE AND COPYRIGHT

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

To install HTML::Detergent, copy and paste the appropriate command in to your terminal.

cpanm

cpanm HTML::Detergent

CPAN shell

perl -MCPAN -e shell
install HTML::Detergent

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)