Unicode::UTF8 - Encoding and decoding of UTF-8 encoding form
use Unicode::UTF8 qw[decode_utf8 encode_utf8]; use warnings FATAL => 'utf8'; # Promote decoding/encoding warnings into exceptions $string = decode_utf8($octets); $octets = encode_utf8($string);
This module provides functions to encode and decode UTF-8 encoding form as specified by Unicode and ISO/IEC 10646:2011.
Returns an decoded representation of $octets in UTF-8 encoding as a character string.
$octets
$string = decode_utf8($octets); $string = decode_utf8($octets, $fallback);
Issues a warning using warnings category utf8 if $octets contains ill-formed UTF-8 sequences or encoded code points which can't be interchanged.
utf8
$fallback is an optional CODE reference which provides a error-handling mechanism, allowing customization of error handling. The default error-handling mechanism is to replace any ill-formed UTF-8 sequences or encoded code points which can't be interchanged with REPLACEMENT CHARACTER (U+FFFD).
$fallback
CODE
$string = $fallback->($octets, $is_usv, $position);
$fallback is invoked with three arguments: $octets, $is_usv and $position. $octets is a sequence of one or more octets containing the maximal subpart of the ill-formed subsequence or encoded code point which can't be interchanged. $is_usv is a boolean indicating whether or not $octets represent a encoded Unicode scalar value. $position is a unsigned integer containing the zero based octet position at which the error occurred within the octets provided to decode_utf8(). $fallback must return a character string consisting of zero or more Unicode scalar values. Unicode scalar values consist of code points in the range U+0000..U+D7FF and U+E000..U+10FFFF.
$is_usv
$position
decode_utf8()
Returns an encoded representation of $string in UTF-8 encoding as an octet string.
$string
$octets = encode_utf8($string); $octets = encode_utf8($string, $fallback);
Issues a warning using warnings category utf8 if $string contains code points which can't be interchanged or represented in UTF-8 encoding form.
$fallback is an optional CODE reference which provides a error-handling mechanism, allowing customization of error handling. The default error-handling mechanism is to replace any code points which can't be interchanged or represented in UTF-8 encoding form with REPLACEMENT CHARACTER (U+FFFD).
$string = $fallback->($codepoint, $is_usv, $position);
$fallback is invoked with three arguments: $codepoint, $is_usv and $position. $codepoint is a unsigned integer containing the code point which can't be interchanged or represented in UTF-8 encoding form. $is_usv is a boolean indicating whether or not $codepoint is a Unicode scalar value. $position is a unsigned integer containing the zero based character position at which the error occurred within the string provided to encode_utf8(). $fallback must return a character string consisting of zero or more Unicode scalar values.Unicode scalar values consist of code points in the range U+0000..U+D7FF and U+E000..U+10FFFF.
$codepoint
encode_utf8()
None by default. All functions can be exported using the :all tag or individually.
:all
(F) Wide character in octets.
(W utf8) Encountered an ill-formed UTF-8 octet sequence. <%s> contains a hexadecimal representation of the maximal subpart of the ill-formed subsequence.
(W utf8, nonchar) Noncharacters is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10^16) and the values U+FDD0..U+FDEF.
(W utf8, surrogate) Surrogate code points are designated only for surrogate code units in the UTF-16 character encoding form. Surrogates consist of code points in the range U+D800 to U+DFFF.
(W utf8, non_unicode) Code points greater than U+10FFFF. Perl's extended codespace.
(F) Encountered an ill-formed octet sequence in Perl's internal representation of wide characters.
Please note that the sub-categories of utf8 warning nonchar, surrogate and non_unicode is only available on Perl 5.14 or greater. See perllexwarn for available categories and hierarchies.
nonchar
surrogate
non_unicode
Here is a summary of features for comparison with Encode's UTF-8 implementation:
Simple API which makes use of Perl's standard warning categories.
Recognizes all noncharacters regardless of perl version
Implements Unicode's recommended practice for using U+FFFD.
Good diagnostics in warning/exception messages
Detects and reports inconsistency in Perl's internal representation of wide characters (UTF-X)
Preserves taintedness of decoded $octets or encoded $string
Better performance ~ 600% - 1200% (JA: 600%, AR: 700%, SV: 900%, EN: 1200%, see benchmarks directory in git repository)
It's the author's believe that this UTF-8 implementation is conformant with the Unicode Standard Version 6.0. Any deviations from the Unicode Standard is to be considered a bug.
Please report any bugs by email to bug-unicode-utf8 at rt.cpan.org, or through the web interface at http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-UTF8. You will be automatically notified of any progress on the request by the system.
bug-unicode-utf8 at rt.cpan.org
This is open source software. The code repository is available for public review and contribution under the terms of the license.
http://github.com/chansen/p5-unicode-utf8
git clone http://github.com/chansen/p5-unicode-utf8
Christian Hansen chansen@cpan.org
chansen@cpan.org
Copyright 2011 by Christian Hansen.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install Unicode::UTF8, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Unicode::UTF8
CPAN shell
perl -MCPAN -e shell install Unicode::UTF8
For more information on module installation, please visit the detailed CPAN module installation guide.