NAME

Search::Tools::UTF8 - UTF8 string wrangling

SYNOPSIS

 use Search::Tools::UTF8;
 
 my $str = 'foo bar baz';
 
 print "bad UTF-8 sequence: " . find_bad_utf8($str)
    unless is_valid_utf8($str);
 
 print "bad ascii byte at position " . find_bad_ascii($str)
    unless is_ascii($str);
 
 print "bad latin1 byte at position " . find_bad_latin1($str)
    unless is_latin1($str);

DESCRIPTION

Search::Tools::UTF8 supplies common UTF8-related functions.

FUNCTIONS

byte_length( text )

Returns the number of bytes in text regardless of encoding.

is_valid_utf8( text )

Returns true if text is a valid sequence of UTF-8 bytes, regardless of how Perl has it flagged (is_utf8 or not).

is_ascii( text )

If text contains no bytes above 127, then returns true (1). Otherwise, returns false (0). Used by convert() internally to check text prior to transliterating.

is_latin1( text )

Returns true if text lies within the Latin1 charset.

NOTE: Only Latin1 octets with a valid representable character are checked. Octets in the range \x80 - \x9f are not considered valid Latin1 and if found in text, is_latin1() will return false.

CAUTION: A string of bytes can be both valid Latin1 and valid UTF-8, even though the string doesn't represent the same Unicode codepoint(s). Example:

 my $str = "\x{d9}\x{a6}";  # same as \x{666}
 is_valid_utf8($str);       # returns true
 is_latin1($str);           # returns true

Thus is_latin1() (and likewise find_bad_latin1()) are not foolproof. Use them in combination with is_flagged_utf8() to get a better test.

is_flagged_utf8( text )

Returns true if Perl thinks text is UTF-8. Same as Encode::is_utf8().

is_perl_utf8_string( text )

Wrapper around the native Perl is_utf8_string() function. Called by is_valid_utf8().

is_sane_utf8( text [,warnings] )

Will test for double-y encoded text. Returns true if text looks ok. From Text::utf8 docs:

 Strings that are not utf8 always automatically pass.

Pass a second true param to get diagnostics on stderr.

find_bad_utf8( text )

Returns string of bad bytes from text. This of course assumes that text is not valid UTF-8, so use it like:

 croak "bad bytes: " . find_bad_utf8($str) 
    unless is_valid_utf8($str);

If text is a valid UTF-8 string, returns undef.

find_bad_ascii( text )

Returns position of first non-ASCII byte or -1 if text is all ASCII.

find_bad_latin1( text )

Returns position of first non-Latin1 byte or -1 if text is valid Latin1.

find_bad_latin1_report( text )

Returns position of first non-Latin1 byte (like find_bad_latin1()) and also carps about what the decimal and hex values of the bad byte are.

to_utf8( text, charset )

Shorthand for running text through appropriate is_*() checks and then converting to UTF-8 if necessary. Returns text encoded and flagged as UTF-8.

Returns undef if for some reason the encoding failed or the result did not pass is_sane_utf8().

looks_like_cp1252( text )

This function tests that there are bytes in text between 0x80 and 0x9f inclusive. Those bytes are used by the Windows-1252 character set and include some of the troublesome characters like curly quotes.

See also fix_cp1252_codepoints_in_utf8() and the Search::Tools::Transliterate convert1252() method.

fix_cp1252_codepoints_in_utf8( text )

The Windows-1252 codepoints between 0x80 and 0x9f may be encoded validly as UTF-8 but the Unicode standard does not map any characters at those codepoints. fix_cp1252_codepoints_in_utf8() converts a UTF-8 encoded string text to map the suspect 1252 codepoints to their correct Unicode representations.

Note that fix_cp1252_codepoints_in_utf8() is different from the fix_latin() function used in Transliterate, which does not differentiate between a Windows-1252 encoded string and a UTF-8 encoded string.

This function will croak if text does not pass is_valid_utf8().

debug_bytes( text )

Iterates over each byte in text, printing byte, hex and decimal values to stderr.

AUTHOR

Peter Karman <karman@cpan.org>

Thanks to Atomic Learning www.atomiclearning.com for sponsoring the development of some of these modules.

Many of the UTF-8 tests come directly from Test::utf8.

BUGS

Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Search::Tools

You can also look for information at:

RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Search-Tools
AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/Search-Tools
CPAN Ratings

http://cpanratings.perl.org/d/Search-Tools
Search CPAN

http://search.cpan.org/dist/Search-Tools/

COPYRIGHT

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)