NAME

Unicode::UTF8 - Encoding and decoding of UTF-8 encoding form

SYNOPSIS

use Unicode::UTF8 qw[decode_utf8 encode_utf8 read_utf8];

use warnings FATAL => 'utf8'; # fatalize encoding glitches
$string = decode_utf8($octets);
$octets = encode_utf8($string);

$count = read_utf8($fh, $buf, $length);

DESCRIPTION

This module provides functions to encode and decode UTF-8 encoding form as specified by Unicode and ISO/IEC 10646:2011.

FUNCTIONS

decode_utf8

$string = decode_utf8($octets);
$string = decode_utf8($octets, $fallback);

Returns a decoded representation of $octets in UTF-8 encoding as a character string.

$fallback is an optional CODE reference which provides a error-handling mechanism, allowing customization of error handling. The default error-handling mechanism is to replace any ill-formed UTF-8 sequences with REPLACEMENT CHARACTER (U+FFFD).

$string = $fallback->($octets, $is_usv, $position);

$fallback is invoked with three arguments: $octets, $is_usv and $position. $octets is a sequence of one or more octets containing the maximal subpart of the ill-formed subsequence or encoded code point which can't be interchanged. $is_usv is a boolean indicating whether or not $octets represent a encoded Unicode scalar value. $position is a unsigned integer containing the zero based octet position at which the error occurred within the octets provided to decode_utf8(). $fallback must return a character string consisting of zero or more Unicode scalar values. Unicode scalar values consist of code points in the range U+0000..U+D7FF and U+E000..U+10FFFF.

encode_utf8

$octets = encode_utf8($string);
$octets = encode_utf8($string, $fallback);

Returns an encoded representation of $string in UTF-8 encoding as an octet string.

$fallback is an optional CODE reference which provides a error-handling mechanism, allowing customization of error handling. The default error-handling mechanism is to replace any code points which can't be interchanged or represented in UTF-8 encoding form with REPLACEMENT CHARACTER (U+FFFD).

$string = $fallback->($codepoint, $is_usv, $position);

$fallback is invoked with three arguments: $codepoint, $is_usv and $position. $codepoint is a unsigned integer containing the code point which can't be interchanged or represented in UTF-8 encoding form. $is_usv is a boolean indicating whether or not $codepoint is a Unicode scalar value. $position is a unsigned integer containing the zero based character position at which the error occurred within the string provided to encode_utf8(). $fallback must return a character string consisting of zero or more Unicode scalar values.Unicode scalar values consist of code points in the range U+0000..U+D7FF and U+E000..U+10FFFF.

read_utf8

$count = read_utf8($fh, $buf, $length);
$count = read_utf8($fh, $buf, $length, $offset);

Reads up to $length UTF-8 encoded characters (code points) from the file handle $fh, decoding and validating them in place, and stores the result in $buf. Returns the number of characters actually read, 0 at end of file, or undef on a read error (with $! set).

Because read_utf8 reads and validates the octets directly, there is no need to apply a PerlIO encoding layer (such as :encoding(UTF-8) or :utf8) to $fh. The handle should be a plain byte handle; the bytes are validated and decoded by read_utf8 itself.

If $offset is specified, the read data is written into $buf starting at that character offset, preserving the existing content before it. A negative $offset counts back from the end of $buf. If the offset is past the end of the string, $buf is zero-filled up to the offset first.

Ill-formed and truncated input is not fatal: each maximal ill-formed subpart is replaced with the Unicode replacement character U+FFFD and a warning is emitted in the utf8 warnings category. The returned count includes the substituted code points.

Tied file handles are not supported.

Since version 0.71.

slurp_utf8

$string = slurp_utf8($filename);

Reads the entire file named $filename and returns its contents decoded from UTF-8 as a character string. The file is read using unbuffered (:unix) IO, and the octets are validated and decoded directly.

There is no upper bound on the amount read; the whole file is loaded into memory. When handling untrusted or potentially large files, check the size first (for example with -s $filename) before calling slurp_utf8.

Ill-formed and truncated input is not fatal: each maximal ill-formed subpart is replaced with the Unicode replacement character U+FFFD and a warning is emitted in the utf8 warnings category.

Throws an exception if the file cannot be opened or read.

Since version 0.73.

valid_utf8

$boolean = valid_utf8($octets);

Returns a boolean indicating whether or not the given $octets consist of well-formed UTF-8 sequences.

Since version 0.60.

EXPORTS

None by default. All functions can be exported using the :all tag or individually.

DIAGNOSTICS

Can't decode a wide character string: (F) Wide character in octets.
Can't validate a wide character string: (F) Wide character in octets.
Can't decode ill-formed UTF-8 octet sequence <%s> in position %u: (W utf8) Encountered an ill-formed UTF-8 octet sequence. <%s> contains a hexadecimal representation of the maximal subpart of the ill-formed subsequence.
Can't represent surrogate code point U+%X in position %u: (W utf8, surrogate) Surrogate code points are designated only for surrogate code units in the UTF-16 character encoding form. Surrogates consist of code points in the range U+D800 to U+DFFF.
Can't represent super code point \x{%X} in position %u: (W utf8, non_unicode) Code points greater than U+10FFFF. Perl's extended codespace.
Can't decode ill-formed UTF-X octet sequence <%s> in position %u: (F) Encountered an ill-formed octet sequence in Perl's internal representation of wide characters.

The sub-categories: surrogate and non_unicode is only available on Perl 5.14 or greater. See perllexwarn for available categories and hierarchies.

COMPARISON

Here is a summary of features for comparison with Encode's UTF-8 implementation:

Simple API which makes use of Perl's standard warning categories.
Implements Unicode's recommended practice for using U+FFFD.
Better diagnostics in warning messages
Detects and reports inconsistency in Perl's internal representation of wide characters (UTF-X)
Preserves taintedness of decoded $octets or encoded $string
Better performance ~ 600% - 1200% (JA: 600%, AR: 700%, SV: 900%, EN: 1200%, see benchmarks directory in git repository)

CONFORMANCE

It's the author's belief that this UTF-8 implementation is conformant with the Unicode Standard Version 6.0. Any deviations from the Unicode Standard is to be considered a bug.

SUPPORT

BUGS

Please report any bugs through the web interface at https://github.com/chansen/p5-unicode-utf8/issues. You will be automatically notified of any progress on the request by the system.

SOURCE CODE

This is open source software. The code repository is available for public review and contribution under the terms of the license.

http://github.com/chansen/p5-unicode-utf8

git clone http://github.com/chansen/p5-unicode-utf8

AUTHOR

Christian Hansen chansen@cpan.org

COPYRIGHT

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

To install Unicode::UTF8, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Unicode::UTF8

CPAN shell

perl -MCPAN -e shell
install Unicode::UTF8

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)