Search::Tools::XML - methods for playing nice with XML and HTML
use Search::Tools::XML; my $class = 'Search::Tools::XML'; my $text = 'the "quick brown" fox'; my $xml = $class->start_tag('foo'); $xml .= $class->utf8_safe( $text ); $xml .= $class->end_tag('foo'); # $xml: <foo>the "quick brown" fox</foo> $xml = $class->escape( $xml ); # $xml: <foo>the &#34;quick brown&#34; fox</foo> $xml = $class->unescape( $xml ); # $xml: <foo>the "quick brown" fox</foo> my $plain = $class->no_html( $xml ); # $plain eq $text
IMPORTANT: The API for escape() and unescape() has changed as of version 0.16. The text is no longer modified in place, as this was less intuitive.
Search::Tools::XML provides utility methods for dealing with XML and HTML. There isn't really anything new here that CPAN doesn't provide via HTML::Entities or similar modules. The difference is convenience: the most common methods you need for search apps are in one place with no extra dependencies.
NOTE: To get full UTF-8 character set from chr() you must be using Perl >= 5.8. This affects things like the unescape* methods.
Complete map of all named HTML entities to their decimal values.
The following methods may be accessed either as object or class methods.
Create a Search::Tools::XML object.
Returns a qr// regex for matching a SGML (XML, HTML, etc) tag.
Returns a regex for all whitespace characters and HTML whitespace entities.
Returns a hash reference to the class data mapping chr() values to their numerical entity equivalents.
Returns true if string appears to have HTML-like markup in it.
Aliases for this method include:
Returns string as a tag, either start or end. string will be escaped for any non-valid chars using tag_safe().
Create a valid XML tag name, escaping/omitting invalid characters.
Example:
my $tag = Search::Tools::XML->tag_safe( '1 * ! tag foo' ); # $tag == '______tag_foo'
Return string with special XML chars and all non-ASCII chars converted to numeric entities.
This is escape() on steroids. Do not use them both on the same text unless you know what you're doing. See the SYNOPSIS for an example.
Alias for utf8_safe().
no_html() is a brute-force method for removing all tags and entities from text. A simple regular expression is used, so things like nested comments and the like will probably break. If you really need to reliably filter out the tags and entities from a HTML text, use HTML::Parser or similar.
text is returned with no markup in it.
An alias for no_html().
Similar to escape() functions in more famous CPAN modules, but without the added dependency. escape() will convert the special XML chars (><'"&) to their named entity equivalents.
The escaped text is returned.
IMPORTANT: The API for this method has changed as of version 0.16. text is no longer modified in-place.
As of version 0.27 escape() is written in C/XS for speed.
Similar to unescape() functions in more famous CPAN modules, but without the added dependency. unescape() will convert all entities to their chr() equivalents.
NOTE: unescape() does more than reverse the effects of escape(). It attempts to resolve all entities, not just the special XML entities (><'"&).
Replace all named HTML entities with their chr() equivalents.
Returns modified copy of text.
Replace all decimal entities with their chr() equivalents.
Peter Karman <karman@cpan.org>
<karman@cpan.org>
Originally based on the HTML::HiLiter regular expression building code, by the same author, copyright 2004 by Cray Inc.
Thanks to Atomic Learning www.atomiclearning.com for sponsoring the development of these modules.
www.atomiclearning.com
Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-search-tools at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc Search::Tools
You can also look for information at:
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Search-Tools
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/Search-Tools
CPAN Ratings
http://cpanratings.perl.org/d/Search-Tools
Search CPAN
http://search.cpan.org/dist/Search-Tools/
Copyright 2006-2009 by Peter Karman.
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
HTML::HiLiter, SWISH::HiLiter, Rose::Object, Class::XSAccessor, Text::Aspell
To install Search::Tools, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Search::Tools
CPAN shell
perl -MCPAN -e shell install Search::Tools
For more information on module installation, please visit the detailed CPAN module installation guide.