The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Search::Tools::UTF8 - UTF8 string wrangling

SYNOPSIS

 use Search::Tools::UTF8;
 
 my $str = 'foo bar baz';
 
 print "bad UTF-8 sequence: " . find_bad_utf8($str)
    unless is_valid_utf8($str);
 
 print "bad ascii byte at position " . find_bad_ascii($str)
    unless is_ascii($str);
 
 print "bad latin1 byte at position " . find_bad_latin1($str)
    unless is_latin1($str);
 

DESCRIPTION

Search::Tools::UTF8 supplies common UTF8-related functions.

FUNCTIONS

is_valid_utf8( text )

Returns true if text is a valid sequence of UTF-8 bytes, regardless of how Perl has it flagged (is_utf8 or not).

is_ascii( text )

If text contains no bytes above 127, then returns true (1). Otherwise, returns false (0). Used by convert() internally to check text prior to transliterating.

is_latin1( text )

Returns true if text lies within the Latin1 charset.

NOTE: Only Latin1 octets with a valid representable character are checked. Octets in the range \x80 - \x9f are not considered valid Latin1 and if found in text, is_latin1() will return false.

CAUTION: A string of bytes can be both valid Latin1 and valid UTF-8, even though the string doesn't represent the same Unicode codepoint(s). Example:

 my $str = "\x{d9}\x{a6}";  # same as \x{666}
 is_valid_utf8($str);       # returns true
 is_latin1($str);           # returns true

Thus is_latin1() (and likewise find_bad_latin1()) are not foolproof. Use them in combination with is_flagged_utf8() to get a better test.

is_flagged_utf8( text )

Returns true if Perl thinks text is UTF-8. Same as Encode::is_utf8().

is_perl_utf8_string( text )

Wrapper around the native Perl is_utf8_string() function. Called by is_valid_utf8().

is_sane_utf8( text [,warnings] )

Will test for double-y encoded text. Returns true if text looks ok. From Text::utf8 docs:

 Strings that are not utf8 always automatically pass.

Pass a second true param to get diagnostics on stderr.

find_bad_utf8( text )

Returns string of bad bytes from text. This of course assumes that text is not valid UTF-8, so use it like:

 croak "bad bytes: " . find_bad_utf8($str) 
    unless is_valid_utf8($str);
    

If text is a valid UTF-8 string, returns undef.

find_bad_ascii( text )

Returns position of first non-ASCII byte or -1 if text is all ASCII.

find_bad_latin1( text )

Returns position of first non-Latin1 byte or -1 if text is valid Latin1.

find_bad_latin1_report( text )

Returns position of first non-Latin1 byte (like find_bad_latin1()) and also carps about what the decimal and hex values of the bad byte are.

to_utf8( text, charset )

Shorthand for running text through appropriate is_*() checks and then converting to UTF-8 if necessary. Returns text encoded and flagged as UTF-8.

Returns undef if for some reason the encoding failed or the result did not pass is_sane_utf8().

AUTHOR

Peter Karman <karman@cpan.org>

Originally based on the HTML::HiLiter regular expression building code, by the same author, copyright 2004 by Cray Inc.

Thanks to Atomic Learning www.atomiclearning.com for sponsoring the development of some of these modules.

Many of the UTF-8 tests come directly from Test::utf8.

BUGS

Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Search::Tools

You can also look for information at:

COPYRIGHT

Copyright 2006-2009 by Peter Karman.

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

HTML::HiLiter, SWISH::HiLiter, Rose::Object, Class::XSAccessor, Text::Aspell