NAME

Unicode::UTF8 - Encoding and decoding of UTF-8 encoding form.

SYNOPSIS

    use Unicode::UTF8 qw[decode_utf8 encode_utf8];
    
    $string = decode_utf8($octets);
    $octets = encode_utf8($string);

DESCRIPTION

This module provides functions to encode and decode UTF-8 encoding form as defined by Unicode and ISO/IEC 10646:2011.

FUNCTIONS

decode_utf8($octets)

Returns an decoded representation of $octets in UTF-8 encoding as a character string.

Issues a warning if $octets contains ill-formed UTF-8 sequences or encoded code points which can't be interchanged.

encode_utf8($string)

Returns an encoded representation of $string in UTF-8 encoding as an octet string.

Issues a warning if $string contains code points which can't be interchanged or represented in UTF-8 encoding form.

EXPORTS

None by default. All functions can be exported using the :all tag or individually.

DIAGNOSTICS

Can't decode a wide character string: (F) Wide character in octets.
Can't decode ill-formed UTF-8 octet sequence <%s> in position %u: (W utf8) Encountered an ill-formed UTF-8 octet sequence. <%s> contains a hexadecimal representation of the maximal subpart of the ill-formed subsequence.
Can't interchange noncharacter code point U+%.4X at position %u: (W utf8) Noncharacters is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10^16) and the values U+FDD0..U+FDEF.
Can't represent surrogate code point U+%.4X at position %u in UTF-8 encoding form: (W utf8) Surrogate code points are designated only for surrogate code units in the UTF-16 character encoding form. Surrogates consist of code points in the range U+D800 to U+DFFF.
Can't represent super code point \x{%X} at position %u in UTF-8 encoding form: (W utf8) Code points greater than U+10FFFF. Perl's extended codespace.
Can't decode ill-formed UTF-X octet sequence <%s> in position %u: (F) Encountered an ill-formed octet sequence in Perl's internal representation of wide characters.

COMPARISON

Comparison between Encode's UTF-8 (E) and Unicode::UTF8 (U).

Noncharacters

Unicode::UTF8 recognizes all noncharacters regardless of perl version. Encode only recognize U+FFFF on perl versions below 5.14.

Replacement U+FFFD

Unicode::UTF8 implements Unicode's recommended practice for using U+FFFD.

"\xF1\x80\x80":

U) "\x{FFFD}"

E) "\x{FFFD}\x{FFFD}\x{FFFD}"

Diagnostics

"\xF1\x80\x80":

U) Can't decode ill-formed UTF-8 octet sequence <F1 80 80> in position 0 [...]

E) utf8 "\xF1" does not map to Unicode [...]

"\xEF\xBF\xBF":

U) Can't interchange noncharacter code point U+FFFF in position 0 [...]

E) utf8 "\xFFFF" does not map to Unicode [...]

"\x{D800}":

U) Can't represent surrogate code point U+D800 at position 0 in UTF-8 encoding form [...]

E) "\x{d800}" does not map to utf8 [...]

"\x{110000}":

U) Can't represent super code point \x{110000} at position 0 in UTF-8 encoding form [...]

E) "\x{110000}" does not map to utf8 [...]

"\x{FFFF}":

U) Can't interchange noncharacter code point U+FFFF at position 0 [...]

E) "\x{ffff}" does not map to utf8 [...]

Taint mode

Unicode::UTF8 preserves taintedness, Encode does not.

    $tainted_string = decode_utf8($tainted_octets);
    $tainted_octets = encode_utf8($tainted_string);

Performance

Unicode::UTF8 is ~ 500% to 2000% faster than Encode. https://github.com/chansen/p5-unicode-utf8/blob/master/benchmarks/bench.pl

SUPPORT

Bugs / Feature Requests

Please report any bugs or feature requests by email to bug-unicode-utf8 at rt.cpan.org, or through the web interface at http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-UTF8. You will be automatically notified of any progress on the request by the system.

Source Code

This is open source software. The code repository is available for public review and contribution under the terms of the license.

http://github.com/chansen/p5-unicode-utf8

    git clone http://github.com/chansen/p5-unicode-utf8

AUTHOR

Christian Hansen chansen@cpan.org

COPYRIGHT

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

To install Unicode::UTF8, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Unicode::UTF8

CPAN shell

perl -MCPAN -e shell
install Unicode::UTF8

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)