The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

URL::Transform - perform URL transformations in various document types

SYNOPSIS

my $output;
my $urlt = URL::Transform->new(
    'document_type'      => 'text/html;charset=utf-8',
    'content_encoding'   => 'gzip',
    'output_function'    => sub { $output .= "@_" },
    'transform_function' => sub { return (join '|', @_) },
);
$urlt->parse_file($Bin.'/data/URL-Transform-01.html');

print "and this is the output: ", $output;

DESCRIPTION

URL::Transform is a generic module to perform an url transformation in a documents. Accepts callback function using which the url link can be changed.

There are different modules to handle different document types, elements or attributes:

text/html, text/vnd.wap.wml, application/xhtml+xml, application/vnd.wap.xhtml+xml

URL::Transform::using::HTML::Parser, URL::Transform::using::XML::SAX (incomplete was used only to benchmark)

text/css

URL::Transform::using::CSS::RegExp

text/html/meta-content

URL::Transform::using::HTML::Meta

application/x-javascript

URL::Transform::using::Remove

By passing parser option to the URL::Transform->new() constructor you can set what library will be used to parse and execute the output and transform functions. Note that the elements inside for example text/html that are of a different type will be transformed via "default_for($document_type)" modules.

transform_function is called with following arguments:

transform_function->(
    'tag_name'       => 'img',
    'attribute_name' => 'src',
    'url'            => 'http://search.cpan.org/s/img/cpan_banner.png',
);

and must return (un)modified url as the return value.

output_function is called with (already modified) document chunk for outputting.

PROPERTIES

content_encoding
document_type
parser
transform_function
output_function
parser

For HTML/XML can be HTML::Parser, XML::SAX

document_type
text/html - default
transform_function

Function that will be called to make the transformation. The function will receive one argument - url text.

output_function

Reference to function that will receive resulting output. The default one is to use print.

content_encoding

Can be set to gzip or deflate. By default it is undef, so there is no content encoding.

METHODS

new

Object constructor.

Requires transform_function a CODE ref argument.

The rest of the arguments are optional. Here is the list with defaults:

document_type       => 'text/html;charset=utf-8',
output_function     => sub { print @_ },
parser              => 'HTML::Parser',
content_encoding    => undef,

default_for($document_type)

Returns default parser for a supplied $document_type.

Can be used also as a set function with additional argument - parser name.

If called as object method set the default parser for the object. If called as module function set the default parser for a whole module.

parse_string($string)

Submit document as a string for parsing.

This some function must be implemented by helper parsing classes.

parse_chunk($chunk)

Submit chunk of a document for parsing.

This some function should be implemented by helper parsing classes.

can_parse_chunks

Return true/false if the parser can parse in chunks.

parse_file($file_name)

Submit file for parsing.

This some function should be implemented by helper parsing classes.

# To simplify things, reformat the %HTML::Tagset::linkElements
# hash so that it is always a hash of hashes.

# Construct a hash of tag names that may have links.

js_attributes

# Construct a hash of all possible JavaScript attribute names

decode_string($string)

Will return decoded string suitable for parsing. Decoding is chosen according to the $self->content_encoding.

Decoding is run automatically for every chunk/string/file.

encode_string($string)

Will return encoded string. Encoding is chosen according to the $self->content_encoding.

NOTE if you want to have your content encoded back to the $self->content_encoding you will have to run this method in your code. Argument to the output_function() are always plain text.

get_supported_content_encodings()

Returns hash reference of supported content encodings.

benchmarks

Benchmark: timing 10000 iterations of HTML::Parser    , XML::LibXML::SAX, XML::SAX::PurePerl...
HTML::Parser      :  3 wallclock secs ( 2.41 usr +  0.04 sys =  2.45 CPU) @ 4081.63/s (n=10000)
XML::LibXML::SAX  : 29 wallclock secs (27.22 usr +  0.11 sys = 27.33 CPU) @ 365.90/s (n=10000)
XML::SAX::PurePerl: 192 wallclock secs (180.62 usr +  0.50 sys = 181.12 CPU) @ 55.21/s (n=10000)

TODO

There are urls in pics meta tag: <meta http-equiv="pics-label" content=" .... See http://www.w3.org/PICS/.

SEE ALSO

HTML::Parser, URL::Transform::using::HTML::Parser

AUTHOR

Jozef Kutej <jkutej at cpan.org>

LICENSE AND COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.