The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Search::Tools::XML - methods for playing nice with XML and HTML

SYNOPSIS

 use Search::Tools::XML;

 my $class = 'Search::Tools::XML';

 my $text = 'the "quick brown" fox';

 my $xml = $class->start_tag('foo');

 $xml .= $class->utf8_safe( $text );

 $xml .= $class->end_tag('foo');

 # $xml: <foo>the &#34;quick brown&#34; fox</foo>

 $xml = $class->escape( $xml );

 # $xml: &lt;foo&gt;the &amp;#34;quick brown&amp;#34; fox&lt;/foo&gt;

 $xml = $class->unescape( $xml );

 # $xml: <foo>the "quick brown" fox</foo>

 my $plain = $class->no_html( $xml );

 # $plain eq $text

DESCRIPTION

IMPORTANT: The API for escape() and unescape() has changed as of version 0.16. The text is no longer modified in place, as this was less intuitive.

Search::Tools::XML provides utility methods for dealing with XML and HTML. There isn't really anything new here that CPAN doesn't provide via HTML::Entities or similar modules. The difference is convenience: the most common methods you need for search apps are in one place with no extra dependencies.

NOTE: To get full UTF-8 character set from chr() you must be using Perl >= 5.8. This affects things like the unescape* methods.

VARIABLES

%HTML_ents

Complete map of all named HTML entities to their decimal values.

METHODS

The following methods may be accessed either as object or class methods.

new

Create a Search::Tools::XML object.

tag_re

Returns a qr// regex for matching a SGML (XML, HTML, etc) tag.

html_whitespace

Returns a regex for all whitespace characters and HTML whitespace entities.

char2ent_map

Returns a hash reference to the class data mapping chr() values to their numerical entity equivalents.

looks_like_html( string )

Returns true if string appears to have HTML-like markup in it.

Aliases for this method include:

looks_like_xml
looks_like_markup

start_tag( string [, \%attr ] )

end_tag( string )

Returns string as a tag, either start or end. string will be escaped for any non-valid chars using tag_safe().

If \%attr is passed, XML-safe attributes are generated using attr_safe().

singleton( string [, \%attr ] )

Like start_tag() but includes the closing slash.

tag_safe( string )

Create a valid XML tag name, escaping/omitting invalid characters.

Example:

    my $tag = Search::Tools::XML->tag_safe( '1 * ! tag foo' );
    # $tag == '______tag_foo'

attr_safe( \%attr )

Returns stringified \%attr as XML attributes.

utf8_safe( string )

Return string with special XML chars and all non-ASCII chars converted to numeric entities.

This is escape() on steroids. Do not use them both on the same text unless you know what you're doing. See the SYNOPSIS for an example.

escape_utf8

Alias for utf8_safe().

no_html( text [, normalize_whitespace] )

no_html() is a brute-force method for removing all tags and entities from text. A simple regular expression is used, so things like nested comments and the like will probably break. If you really need to reliably filter out the tags and entities from a HTML text, use HTML::Parser or similar.

text is returned with no markup in it.

If normalize_whitespace is true (defaults to false) then all whitespace is normalized away to ASCII space (U+0020). This can be helpful if you have Unicode entities representing line breaks or other layout instructions.

strip_html

An alias for no_html().

strip_markup

An alias for no_html().

escape( text )

Similar to escape() functions in more famous CPAN modules, but without the added dependency. escape() will convert the special XML chars (><'"&) to their named entity equivalents.

The escaped text is returned.

IMPORTANT: The API for this method has changed as of version 0.16. text is no longer modified in-place.

As of version 0.27 escape() is written in C/XS for speed.

unescape( text )

Similar to unescape() functions in more famous CPAN modules, but without the added dependency. unescape() will convert all entities to their chr() equivalents.

NOTE: unescape() does more than reverse the effects of escape(). It attempts to resolve all entities, not just the special XML entities (><'"&).

IMPORTANT: The API for this method has changed as of version 0.16. text is no longer modified in-place.

unescape_named( text )

Replace all named HTML entities with their chr() equivalents.

Returns modified copy of text.

unescape_decimal( text )

Replace all decimal entities with their chr() equivalents.

Returns modified copy of text.

perl_to_xml( ref [, options] )

Similar to the XML::Simple XMLout() feature, perl_to_xml() will take a Perl data structure ref and convert it to XML.

options should be a hashref with the following supported key/value pairs:

root value

The root element. If value is a string, it is used as the tag name. If value is a hashref, two keys are required:

tag

String indicating the element name.

attrs

Hash ref of attribute key/value pairs (see start_tag()).

wrap_array 1|0

If wrap_array is true (the default), arrayref items are wrapped in an additional XML tag, keeping the array items enclosed in a logical set. If wrap_array is false, each item in the array is treated individually. See strip_plural below for the naming convention for arrayref items.

strip_plural 1|0

The strip_plural option interacts with the wrap_array option.

If strip_plural is a true value and not a CODE ref, any trailing s character will be stripped from the enclosing tag name whenever an array of hashrefs is found. Example:

 my $data = {
    values => [
        {   two   => 2,
            three => 3,
        },
        {   four => 4,
            five => 5,
        },
    ],
 };

 my $xml = $utils->perl_to_xml($data, {
    root            => 'data',
    wrap_array      => 1,
    strip_plural    => 1,
 });

 # $xml DOM will look like:

 <data>
  <values>
   <value>
    <three>3</three>
    <two>2</two>
   </value>
   <value>
    <five>5</five>
    <four>4</four>
   </value>
  </values>
 </data>

Obviously stripping the final s will not always render sensical tag names. Pass a CODE ref instead, expecting one value (the tag name) and returning the tag name to use:

 my $xml = $utils->perl_to_xml($data, {
    root            => 'data',
    wrap_array      => 1,
    strip_plural    => sub {
        my $tag = shift;
        $tag =~ s/foo/BAR/;
        return $tag;
    },
 });
escape 1|0

If escape is false, strings within the ref value will not be passed through escape(). Default is true.

perl_to_xml( ref, root_element [, strip_plural ][, do_not_escape] )

This second usage is deprecated and here for backwards compatability only. Use the named key/value options instead. Readers of your code (including you!) will thank you.

tidy( xmlstring )

Attempts to indent xmlstring correctly to make it more legible.

Returns the xmlstring tidied up.

WARNING This is an experimental feature. It might be really slow or eat your XML. You have been warned.

AUTHOR

Peter Karman <karman@cpan.org>

Originally based on the HTML::HiLiter regular expression building code, by the same author, copyright 2004 by Cray Inc.

Thanks to Atomic Learning www.atomiclearning.com for sponsoring the original development of these modules.

BUGS

Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Search::Tools

You can also look for information at:

COPYRIGHT

Copyright 2006-2009 by Peter Karman.

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

HTML::HiLiter, SWISH::HiLiter, Class::XSAccessor, Text::Aspell