Unicode::UTF8 - Decoding and encoding of UTF-8 encoding form
use Unicode::UTF8 qw[decode_utf8 encode_utf8]; $string = decode_utf8($octets); $octets = encode_utf8($string);
This module provides functions to encode and decode UTF-8 encoding form as specified by Unicode and ISO/IEC 10646:2011.
Returns an decoded representation of $octets in UTF-8 encoding as a character string.
$octets
Issues a warning using warnings category utf8 if $octets contains ill-formed UTF-8 sequences or encoded code points which can't be interchanged.
utf8
$fallback is a CODE reference which provides a error-handling mechanism, allowing customization of error handling. The default error-handling mechanism is to replace any ill-formed UTF-8 sequences or encoded code points which can't be interchanged with REPLACEMENT CHARACTER (U+FFFD).
$fallback
CODE
$string = $fallback->($octets, $is_usv);
$fallback is invoked with two arguments, $octets and $is_usv. $octets is a sequence of one or more octets containing the maximal subpart of the ill-formed subsequence or encoded code point which can't be interchanged. $is_usv is a boolean indicating whether or not $octets represent a encoded Unicode scalar value. $fallback must return a character string consisting of zero or more Unicode characters.
$is_usv
Returns an encoded representation of $string in UTF-8 encoding as an octet string.
$string
Issues a warning using warnings category utf8 if $string contains code points which can't be interchanged or represented in UTF-8 encoding form.
$fallback is a CODE reference which provides a error-handling mechanism, allowing customization of error handling. The default error-handling mechanism is to replace any code points which can't be interchanged or represented in UTF-8 encoding form with REPLACEMENT CHARACTER (U+FFFD).
$string = $fallback->($codepoint, $is_usv);
$fallback is invoked with two arguments, $codepoint and $is_usv. $codepoint is a unsigned integer containing the code point which can't be interchanged or represented in UTF-8 encoding form. $is_usv is a boolean indicating whether or not $codepoint is a Unicode scalar value. $fallback must return a character string consisting of zero or more Unicode characters.
$codepoint
None by default. All functions can be exported using the :all tag or individually.
:all
(F) Wide character in octets.
(W utf8) Encountered an ill-formed UTF-8 octet sequence. <%s> contains a hexadecimal representation of the maximal subpart of the ill-formed subsequence.
(W utf8, nonchar) Noncharacters is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10^16) and the values U+FDD0..U+FDEF.
(W utf8, surrogate) Surrogate code points are designated only for surrogate code units in the UTF-16 character encoding form. Surrogates consist of code points in the range U+D800 to U+DFFF.
(W utf8, non_unicode) Code points greater than U+10FFFF. Perl's extended codespace.
(F) Encountered an ill-formed octet sequence in Perl's internal representation of wide characters.
Please note that the sub-categories of utf8 warning nonchar, surrogate and non_unicode is only available on Perl 5.14 or greater. See perllexwarn for available categories and hierarchies.
nonchar
surrogate
non_unicode
Here is a summary of features for comparison with Encode's UTF-8 implementation:
simple API which makes use of Perl's standard warning categories.
recognizes all noncharacters regardless of perl version
implements Unicode's recommended practice for using U+FFFD
good diagnostics in warnings messages
detects and reports inconsistency in perl's internal encoding (UTF-X)
preserves taintedness of decoded $octets or encoded $string
better performance ~ 600% - 1200% (JA: 600%, AR: 700%, SV: 900%, EN: 1200%, see benchmarks directory in git repository)
It's the author's believe that this UTF-8 implementation is conformant with the Unicode Standard Version 6.0. Any deviations from the Unicode Standard is to be considered a bug.
Please report any bugs by email to bug-unicode-utf8 at rt.cpan.org, or through the web interface at http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-UTF8. You will be automatically notified of any progress on the request by the system.
bug-unicode-utf8 at rt.cpan.org
This is open source software. The code repository is available for public review and contribution under the terms of the license.
http://github.com/chansen/p5-unicode-utf8
git clone http://github.com/chansen/p5-unicode-utf8
Christian Hansen chansen@cpan.org
chansen@cpan.org
Copyright 2011 by Christian Hansen.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install Unicode::UTF8, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Unicode::UTF8
CPAN shell
perl -MCPAN -e shell install Unicode::UTF8
For more information on module installation, please visit the detailed CPAN module installation guide.