The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Search::Tools::XML - methods for playing nice with XML and HTML

SYNOPSIS

 use Search::Tools::XML;
 
 my $class = 'Search::Tools::XML';
 
 my $text = 'the "quick brown" fox';
 
 my $xml = $class->start_tag('foo');
 
 $xml .= $class->utf8_safe( $text );
 
 $xml .= $class->end_tag('foo');
 
 # $xml: <foo>the &#34;quick brown&#34; fox</foo>
 
 $xml = $class->escape( $xml );
 
 # $xml: &lt;foo&gt;the &amp;#34;quick brown&amp;#34; fox&lt;/foo&gt;
 
 $xml = $class->unescape( $xml );
 
 # $xml: <foo>the "quick brown" fox</foo>
 
 my $plain = $class->no_html( $xml );
 
 # $plain eq $text
 
 

DESCRIPTION

IMPORTANT: The API for escape() and unescape() has changed as of version 0.16. The text is no longer modified in place, as this was less intuitive.

Search::Tools::XML provides utility methods for dealing with XML and HTML. There isn't really anything new here that CPAN doesn't provide via HTML::Entities or similar modules. The difference is convenience: the most common methods you need for search apps are in one place with no extra dependencies.

NOTE: To get full UTF-8 character set from chr() you must be using Perl >= 5.8. This affects things like the unescape* methods.

VARIABLES

%HTML_ents

Complete map of all named HTML entities to their decimal values.

METHODS

The following methods may be accessed either as object or class methods.

new

Create a Search::Tools::XML object.

tag_re

Returns a qr// regex for matching a SGML (XML, HTML, etc) tag.

html_whitespace

Returns a regex for all whitespace characters and HTML whitespace entities.

char2ent_map

Returns a hash reference to the class data mapping chr() values to their numerical entity equivalents.

looks_like_html( string )

Returns true if string appears to have HTML-like markup in it.

Aliases for this method include:

looks_like_xml
looks_like_markup

start_tag( string )

end_tag( string )

Returns string as a tag, either start or end. string will be escaped for any non-valid chars using tag_safe().

tag_safe( string )

Create a valid XML tag name, escaping/omitting invalid characters.

Example:

        my $tag = Search::Tools::XML->tag_safe( '1 * ! tag foo' );
    # $tag == '______tag_foo'

utf8_safe( string )

Return string with special XML chars and all non-ASCII chars converted to numeric entities.

This is escape() on steroids. Do not use them both on the same text unless you know what you're doing. See the SYNOPSIS for an example.

escape_utf8

Alias for utf8_safe().

no_html( text )

no_html() is a brute-force method for removing all tags and entities from text. A simple regular expression is used, so things like nested comments and the like will probably break. If you really need to reliably filter out the tags and entities from a HTML text, use HTML::Parser or similar.

text is returned with no markup in it.

strip_html

An alias for no_html().

escape( text )

Similar to escape() functions in more famous CPAN modules, but without the added dependency. escape() will convert the special XML chars (><'"&) to their named entity equivalents.

The escaped text is returned.

IMPORTANT: The API for this method has changed as of version 0.16. text is no longer modified in-place.

As of version 0.27 escape() is written in C/XS for speed.

unescape( text )

Similar to unescape() functions in more famous CPAN modules, but without the added dependency. unescape() will convert all entities to their chr() equivalents.

NOTE: unescape() does more than reverse the effects of escape(). It attempts to resolve all entities, not just the special XML entities (><'"&).

IMPORTANT: The API for this method has changed as of version 0.16. text is no longer modified in-place.

unescape_named( text )

Replace all named HTML entities with their chr() equivalents.

Returns modified copy of text.

unescape_decimal( text )

Replace all decimal entities with their chr() equivalents.

Returns modified copy of text.

AUTHOR

Peter Karman <karman@cpan.org>

Originally based on the HTML::HiLiter regular expression building code, by the same author, copyright 2004 by Cray Inc.

Thanks to Atomic Learning www.atomiclearning.com for sponsoring the development of these modules.

BUGS

Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Search::Tools

You can also look for information at:

COPYRIGHT

Copyright 2006-2009 by Peter Karman.

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

HTML::HiLiter, SWISH::HiLiter, Rose::Object, Class::XSAccessor, Text::Aspell