HTML::Inspect - Inspect a HTML document
my $source = 'http://example.com/doc'; my $inspector = HTML::Inspect->new( location => $source, html_ref => \$html, ); my $classic = $inspector->collectMetaClassic;
This module extracts information from HTML, using a clean parser (XML::LibXML) Returned structures may need further processing. Please suggest additional extractors.
This module is part of the "Crawl Pipeline". You can find a detailed description of each of the output of the methods below on its web-page at https://pipeline.shared-search.eu/extract/
URL normalization is a really crucial feature of the output of these methods. You can use this separately via functions in HTML::Inspect::Normalization.
-Option --Default html_ref <required> location <required>
References to a (possibly troublesome) HTML string. Passed as reference to avoid copying large strings.
An absolute url as a string or URI instance, which explains where the HTML was found. It is used as base of relative URLs found in the HTML, unless it contains as <base> element.
<base>
The base URI, which is used for relative links in the page. This is the location, unless the HTML contains a <base href> declaration. The base URI is a string representation, in absolute and normalized form.
location
<base href>
The URI object which represents the location parameter which was passed as default base for relative links to new().
new()
Collect all <link> relations from the document. The returned HASH contains the relation (the rel attribute, required) to an ARRAY of link elements with that value. The ARRAY elements are HASHes of all attributes of the link and and all lower-cased. The added href_uri key will be a normalized, absolute translation of the href attribute.
<link>
rel
href_uri
href
Returns an ARRAY of all kinds of <meta> records, which have a wide variety of fields and may be order dependend!!!
<meta>
example:
[ { http-equiv => 'Content-Type', content => 'text/html; charset=UTF-8' }, { name => 'viewport', content => 'width=device-width, initial-scale=1.0' }, ]
Returns a HASH reference with all <meta> information of traditional content: the single charset and all http-equiv records, plus the subset of names which are listed on https://www.w3schools.com/tags/tag_meta.asp. People defined far too many names to be useful for everyone.
charset
http-equiv
{ 'http-equiv' => { 'content-type' => 'text/plain' }, charset => 'UTF-8', name => { author => 'John Smith' , description => 'The John Smith\'s page.'}, }
Returns a HASH with all <meta> records which have both a name and a content attribute. These are used as key-value pairs for many, many different purposes.
name
content
{ author => 'John Smith' , description => 'The John Smith\'s page.'}
The amount of references is large (easily a few hundred per HTML page), so you may wat to specify a filter. The %filter rules will produce a subset of the links found. You can use: http_only (returning only http and https links), mailto_only, maximum_set (returning only the first n links) and matching, returning links matching a certain regex.
%filter
http_only
mailto_only
maximum_set
n
matching
Collects all references from document. Method collectReferencesFor() is called for a list of known tag/attribute pairs, and returned as a HASH of ARRAYs. The keys of the HASH have format "$tag_$attribute".
collectReferencesFor()
Returns an ARRAY of unique normalized URIs, which where found with the $tag attribute $attr. For instance, tag image attribute src. The URIs are in their textual order in the document, where only the first encounter is recorded.
$tag
$attr
image
src
Returns structured OpenGraph information, when available in the HTML.
The logic really understands OpenGraph, and simplifies access to it: facts which may appear multiple times will always be returned as ARRAY.
XML::LibXML, Log::Report
This software is a component of the Crawl Pipeline, https://pipeline.shared-search.eu. Development was made possible with a generous gift by the NLnet Foundation.
Mark Overmeer CPAN ID: MARKOV markov at cpan dot org Красимир Беров CPAN ID: BEROV berov на cpan точка org https://studio-berov.eu
This is free software, licensed under: The Artistic License 2.0 (GPL Compatible) The full text of the license can be found in the LICENSE file included with this module.
This module is part of HTML-Inspect distribution version 1.00, built on December 08, 2021. Website: http://perl.overmeer.net/CPAN/
Copyrights 2021 by [Mark Overmeer <markov@cpan.org>]. For other contributors see ChangeLog.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://dev.perl.org/licenses/
To install HTML::Inspect, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::Inspect
CPAN shell
perl -MCPAN -e shell install HTML::Inspect
For more information on module installation, please visit the detailed CPAN module installation guide.