NAME

Search::Tools::UTF8 - UTF8 string wrangling

SYNOPSIS

 use Search::Tools::UTF8;
 
 my $str = 'foo bar baz';
 
 print "bad UTF-8 sequence: " . find_bad_utf8($str)
    unless is_valid_utf8($str);
 
 print "bad ascii byte at position " . find_bad_ascii($str)
    unless is_ascii($str);
 
 print "bad latin1 byte at position " . find_bad_latin1($str)
    unless is_latin1($str);

DESCRIPTION

Search::Tools::UTF8 supplies common UTF8-related functions.

FUNCTIONS

is_valid_utf8( text )

Returns true if text is a valid sequence of UTF-8 bytes, regardless of how Perl has it flagged (is_utf8 or not).

is_ascii( text )

If text contains no bytes above 127, then returns true (1). Otherwise, returns false (0). Used by convert() internally to check text prior to transliterating.

is_latin1( text )

Returns true if text lies within the Latin1 charset.

NOTE: Only Latin1 octets with a valid representable character are checked. Octets in the range \x80 - \x9f are not considered valid Latin1 and if found in text, is_latin1() will return false.

CAUTION: A string of bytes can be both valid Latin1 and valid UTF-8, even though the string doesn't represent the same Unicode codepoint(s). Example:

 my $str = "\x{d9}\x{a6}";  # same as \x{666}
 is_valid_utf8($str);       # returns true
 is_latin1($str);           # returns true

Thus is_latin1() (and likewise find_bad_latin1()) are not foolproof. Use them in combination with is_flagged_utf8() to get a better test.

is_flagged_utf8( text )

Returns true if Perl thinks text is UTF-8. Same as Encode::is_utf8().

is_perl_utf8_string( text )

Wrapper around the native Perl is_utf8_string() function. Called by is_valid_utf8().

is_sane_utf8( text [,warnings] )

Will test for double-y encoded text. Returns true if text looks ok. From Text::utf8 docs:

 Strings that are not utf8 always automatically pass.

Pass a second true param to get diagnostics on stderr.

find_bad_utf8( text )

Returns string of bad bytes from text. This of course assumes that text is not valid UTF-8, so use it like:

 croak "bad bytes: " . find_bad_utf8($str) 
    unless is_valid_utf8($str);

If text is a valid UTF-8 string, returns undef.

find_bad_ascii( text )

Returns position of first non-ASCII byte or -1 if text is all ASCII.

find_bad_latin1( text )

Returns position of first non-Latin1 byte or -1 if text is valid Latin1.

find_bad_latin1_report( text )

Returns position of first non-Latin1 byte (like find_bad_latin1()) and also carps about what the decimal and hex values of the bad byte are.

to_utf8( text, charset )

Shorthand for running text through appropriate is_*() checks and then converting to UTF-8 if necessary. Returns text encoded and flagged as UTF-8.

Returns undef if for some reason the encoding failed or the result did not pass is_sane_utf8().

BUGS

AUTHOR

Peter Karman perl@peknet.com

Thanks to Atomic Learning www.atomiclearning.com for sponsoring the development of this module.

Many of the UTF-8 tests come directly from Test::utf8.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

FUNCTIONS

is_valid_utf8( text )

is_ascii( text )

is_latin1( text )

is_flagged_utf8( text )

is_perl_utf8_string( text )

is_sane_utf8( text [,warnings] )

find_bad_utf8( text )

find_bad_ascii( text )

find_bad_latin1( text )

find_bad_latin1_report( text )

to_utf8( text, charset )

BUGS

AUTHOR

COPYRIGHT

SEE ALSO

NAME

SYNOPSIS

DESCRIPTION

FUNCTIONS

is_valid_utf8( text )

is_ascii( text )

is_latin1( text )

is_flagged_utf8( text )

is_perl_utf8_string( text )

is_sane_utf8( text [,warnings] )

find_bad_utf8( text )

find_bad_ascii( text )

find_bad_latin1( text )

find_bad_latin1_report( text )

to_utf8( text, charset )

BUGS

AUTHOR

COPYRIGHT

SEE ALSO

Module Install Instructions