Unicode::UTF8 - Encoding and decoding of UTF-8 encoding form.
use Unicode::UTF8 qw[decode_utf8 encode_utf8]; $string = decode_utf8($octets); $octets = encode_utf8($string);
This module provides functions to encode and decode UTF-8 encoding form as specified by Unicode and ISO/IEC 10646:2011.
Here is a summary of features for comparison with Encode:
simple API which integrates with warnings category utf8
utf8
recognizes all noncharacters regardless of perl version
implements Unicode's recommended practice for using U+FFFD
helpful diagnostics in warnings messages
detects and reports inconsistency in perl's internal encoding (UTF-X)
preserves taintedness of decoded $octets or encoded $string
$octets
$string
better performance ~ 600% - 1200% (JA: 600%, AR: 700%, SV: 900%, EN: 1200%, see benchmarks directory in git repository)
Returns an decoded representation of $octets in UTF-8 encoding as a character string.
Issues a warning using warnings category utf8 if $octets contains ill-formed UTF-8 sequences or encoded code points which can't be interchanged.
$fallback is a CODE reference which provides a error-handling mechanism, allowing customization of error handling. The default error-handling mechanism is to replace any ill-formed UTF-8 sequences or encoded code points which can't be interchanged with REPLACEMENT CHARACTER (U+FFFD).
$fallback
CODE
$string = $fallback->($octets, $is_usv);
$fallback is invoked with two arguments, $octets and $is_usv. $octetsconstains a sequence of one or more octets containing the maximal subpart of the ill-formed subsequence or encoded code point which can't be interchanged. $is_usv is a boolean indicating whether or not $octets represent a encoded Unicode scalar value. $fallback must return a character string consisting of zero or more characters.
$is_usv
Returns an encoded representation of $string in UTF-8 encoding as an octet string.
Issues a warning using warnings category utf8 if $string contains code points which can't be interchanged or represented in UTF-8 encoding form.
$fallback is a CODE reference which provides a error-handling mechanism, allowing customization of error handling. The default error-handling mechanism is to replace any code points which can't be interchanged or represented in UTF-8 encoding form with REPLACEMENT CHARACTER (U+FFFD).
$string = $fallback->($codepoint, $is_usv);
$fallback is invoked with two arguments, $codepoint and $is_usv. $codepoint is a unsigned integer containing the code point which can't be interchanged or represented in UTF-8 encoding form. $is_usv is a boolean indicating whether or not $codepoint is a Unicode scalar value. $fallback must return a character string consisting of zero or more characters.
$codepoint
None by default. All functions can be exported using the :all tag or individually.
:all
(F) Wide character in octets.
(W utf8) Encountered an ill-formed UTF-8 octet sequence. <%s> contains a hexadecimal representation of the maximal subpart of the ill-formed subsequence.
(W utf8) Noncharacters is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10^16) and the values U+FDD0..U+FDEF.
(W utf8) Surrogate code points are designated only for surrogate code units in the UTF-16 character encoding form. Surrogates consist of code points in the range U+D800 to U+DFFF.
(W utf8) Code points greater than U+10FFFF. Perl's extended codespace.
(F) Encountered an ill-formed octet sequence in Perl's internal representation of wide characters.
Please report any bugs or feature requests by email to bug-unicode-utf8 at rt.cpan.org, or through the web interface at http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-UTF8. You will be automatically notified of any progress on the request by the system.
bug-unicode-utf8 at rt.cpan.org
This is open source software. The code repository is available for public review and contribution under the terms of the license.
http://github.com/chansen/p5-unicode-utf8
git clone http://github.com/chansen/p5-unicode-utf8
Christian Hansen chansen@cpan.org
chansen@cpan.org
Copyright 2011 by Christian Hansen.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install Unicode::UTF8, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Unicode::UTF8
CPAN shell
perl -MCPAN -e shell install Unicode::UTF8
For more information on module installation, please visit the detailed CPAN module installation guide.