The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Web::PageMeta - get page open-graph / meta data

SYNOPSIS

use Web::PageMeta;
my $page = Web::PageMeta->new(url => "https://www.apa.at/");
say $page->title;
say $page->image;

async fetch previews and images:

use Web::PageMeta;
my @urls = qw(
    https://www.apa.at/
    http://www.diepresse.at/
    https://metacpan.org/
    https://github.com/
);
my @page_views = map { Web::PageMeta->new( url => $_ ) }
        @urls;
Future->wait_all( map { $_->fetch_image_data_ft, } @page_views )->get;
foreach my $pv (@page_views) {
    say 'title> '.$pv->title;
    say 'img_size> '.length($pv->image_data);
}

# alternativelly instead of Future->wait_all()
use Future::Utils qw( fmap_void );
fmap_void(
    sub { return $_[0]->fetch_image_data_ft },
    foreach    => [@page_views],
    concurrent => 3
)->get;

DESCRIPTION

Get (not only) open-graph web page meta data. can be used in both normal and async code.

For any other than 200 http status codes during data downloads, HTTP::Exception is thrown.

ACCESSORS

new

Constructor, only "url" is required.

url

HTTP url to fetch data from.

timeout

In addition to AnyEvent::HTTP timeout will also check time during download as the data are being downloaded and dies when over the limit. Default 5 minutes.

max_size

Will die when the document or image size is greater than this limit. Default 100MB.

user_agent

User-Agent header to use for http requests. Default is one from Chrome 89.0.4389.90.

extra_headers

HashRef with extra http request headers.

Accepts optional HTTP::Cookies compatible object that must provide get_cookies() method. If set will send http cookie headers with each request.

title

Returns title of the page.

description

Returns description of the page.

canonical_url

Returns open-graph url. If not present returns "url".

image

Returns image location of the page.

image_data

Returns image binary data of "image" link.

Will throw 404 exception if there is not "image" link.

page_meta

Returns hash ref with all open-graph data.

extra_scraper

Web::Scraper::LibXML object to fetch image, title or description from different than default location.

use Web::Scraper::LibXML;
use Web::PageMeta;
my $escraper = scraper {
    process_first '.slider .camera_wrap div', 'image' => '@data-src';
};
my $wmeta = Web::PageMeta->new(
    url => 'https://www.meon.eu/',
    extra_scraper => $escraper,
);

page_body_hdr

Returns array ref with page [$body,$headers]. Can be useful for post-processing or special/additional data extractions.

Only text/html content-type is accepted for fetching.

fetch_page_meta_ft

Returns future object for fetching paga meta data. See "ASYNC USE". On done "page_meta" hash is returned.

fetch_image_data_ft

Returns future object for fetching image data. See "ASYNC USE" On done "image_data" scalar is returned.

fetch_page_body_hdr_ft

Returns future object for fetching page content and headers. See "ASYNC USE" On done "page_body_hdr" array ref is returned.

ASYNC USE

To run multiple page meta data or image http requests in parallel or to be used in async programs "fetch_page_meta_ft" and fetch_image_data_ft returning Future object can be used. See "SYNOPSIS" or t/02_async.t for sample use.

SEE ALSO

https://ogp.me/

AUTHOR

Jozef Kutej, <jkutej at cpan.org>

LICENSE AND COPYRIGHT

Copyright 2021 jkutej@cpan.org

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.