Unicode::UTF8 - Encoding and decoding of UTF-8 encoding form.
use Unicode::UTF8 qw[decode_utf8 encode_utf8]; $string = decode_utf8($octets); $octets = encode_utf8($string);
This module provides functions to encode and decode UTF-8 encoding form as defined by Unicode and ISO/IEC 10646:2011.
Returns an decoded representation of $octets in UTF-8 encoding as a character string.
$octets
Issues a warning if $octets contains ill-formed UTF-8 sequences or encoded code points which can't be interchanged.
Returns an encoded representation of $string in UTF-8 encoding as an octet string.
$string
Issues a warning if $string contains code points which can't be interchanged or represented in UTF-8 encoding form.
None by default. All functions can be exported using the :all tag or individually.
:all
(F) Wide character in octets.
(W utf8) Encountered an ill-formed UTF-8 octet sequence. <%s> contains a hexadecimal representation of the maximal subpart of the ill-formed subsequence.
(W utf8) Noncharacters is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10^16) and the values U+FDD0..U+FDEF.
(W utf8) Surrogate code points are designated only for surrogate code units in the UTF-16 character encoding form. Surrogates consist of code points in the range U+D800 to U+DFFF.
(W utf8) Code points greater than U+10FFFF. Perl's extended codespace.
(F) Encountered an ill-formed octet sequence in Perl's internal representation of wide characters.
Comparison between Encode's UTF-8 (E) and Unicode::UTF8 (U).
Encode
Unicode::UTF8
Unicode::UTF8 recognizes all noncharacters regardless of perl version. Encode only recognize U+FFFF on perl versions below 5.14.
Unicode::UTF8 implements Unicode's recommended practice for using U+FFFD.
"\xF1\x80\x80":
"\xF1\x80\x80"
U) "\x{FFFD}"
"\x{FFFD}"
E) "\x{FFFD}\x{FFFD}\x{FFFD}"
"\x{FFFD}\x{FFFD}\x{FFFD}"
U) Can't decode ill-formed UTF-8 octet sequence <F1 80 80> in position 0 [...]
E) utf8 "\xF1" does not map to Unicode [...]
"\xEF\xBF\xBF"
U) Can't interchange noncharacter code point U+FFFF in position 0 [...]
E) utf8 "\xFFFF" does not map to Unicode [...]
"\x{D800}"
U) Can't represent surrogate code point U+D800 at position 0 in UTF-8 encoding form [...]
E) "\x{d800}" does not map to utf8 [...]
"\x{110000}"
U) Can't represent super code point \x{110000} at position 0 in UTF-8 encoding form [...]
E) "\x{110000}" does not map to utf8 [...]
"\x{FFFF}"
U) Can't interchange noncharacter code point U+FFFF at position 0 [...]
E) "\x{ffff}" does not map to utf8 [...]
Unicode::UTF8 preserves taintedness, Encode does not.
$tainted_string = decode_utf8($tainted_octets); $tainted_octets = encode_utf8($tainted_string);
Unicode::UTF8 is ~ 500% to 2000% faster than Encode. https://github.com/chansen/p5-unicode-utf8/blob/master/benchmarks/bench.pl
Please report any bugs or feature requests by email to bug-unicode-utf8 at rt.cpan.org, or through the web interface at http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-UTF8. You will be automatically notified of any progress on the request by the system.
bug-unicode-utf8 at rt.cpan.org
This is open source software. The code repository is available for public review and contribution under the terms of the license.
http://github.com/chansen/p5-unicode-utf8
git clone http://github.com/chansen/p5-unicode-utf8
Christian Hansen chansen@cpan.org
chansen@cpan.org
Copyright 2011 by Christian Hansen.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install Unicode::UTF8, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Unicode::UTF8
CPAN shell
perl -MCPAN -e shell install Unicode::UTF8
For more information on module installation, please visit the detailed CPAN module installation guide.