The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

WWW::Flatten - Flatten a web pages deeply and make it portable

SYNOPSIS

    use strict;
    use warnings;
    use utf8;
    use 5.010;
    use WWW::Flatten;
    
    my $basedir = './github/';
    mkdir($basedir);
    
    my $bot = WWW::Flatten->new(
        basedir => $basedir,
        max_conn => 1,
        max_conn_per_host => 1,
        depth => 3,
        filenames => {
            'https://github.com' => 'index.html',
        },
        is_target => sub {
            my $uri = shift->url;
            
            if ($uri =~ qr{\.(css|png|gif|jpeg|jpg|pdf|js|json)$}i) {
                return 1;
            }
            
            if ($uri->host eq 'assets-cdn.github.com') {
                return 1;
            }
            
            return 0;
        },
        normalize => sub {
            my $uri = shift;
            ...
            return $uri;
        }
    );
    
    $bot->crawl;

DESCRIPTION

WWW::Flatten is a web crawling tool for freezing pages into standalone.

This software is considered to be alpha quality and isn't recommended for regular usage.

ATTRIBUTES

depth

Depth limitation. Defaults to 10.

    $ua->depth(10);

filenames

URL-Filename mapping table. This well automatically be increased during crawling but you can pre-define some beforehand.

    $bot->finenames({
        'http://example.com/index.html' => 'index.html',
        'http://example.com/index2.html' => 'index2.html',
    })

basedir

A directory path for output files.

    $bot->basedir('./out');

is_target

Set the condition which indecates whether the job is flatten target or not.

    $bot->is_target(sub {
        my $job = shift;
        ...
        return 1 # or 0
    });

'normalize'

A code reference which perform normalization for URLs. The callback will take Mojo::URL instance.

    $bot->normalize(sub {
        my $url = shift;
        my $modified = ...;
        return $modified;
    });

asset_name

A code reference that generates asset names. Defaults to a preset generator asset_number_generator, which generates 6 digit number. There provides another option asset_hash_generator, which generates 6 character hash.

    $bot->asset_name(WWW::Flatten::asset_hash_generator(6));

max_retry

Max attempt limit of retry in case the server in inresponsible. Defaults to 3.

METHODS

asset_number_generator

Numeric file name generating closure with self containing storage. See also asset_name attribute.

    $bot->asset_name(WWW::Flatten::asset_number_generator(3));

asset_hash_generator

Hash-based file name generating closure with self containing storage. See also asset_name attribute. This function automatically avoid name collision by extending the given length.

If you want the names as short as possible, use the following setting.

    $bot->asset_name(WWW::Flatten::asset_hash_generator(1));

init

Initialize the crawler

get_href

Generate new href with old one.

flatten_html

Replace URLs in a Mojo::DOM instance, according to filenames attribute.

flatten_css

Replace URLs in a CSS string, according to filenames attribute.

save

Save HTTP response into a file.

AUTHOR

Sugama Keita, <sugama@jamadam.com>

COPYRIGHT AND LICENSE

Copyright (C) jamadam

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.