The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Unicode::Transform - conversion among Unicode Transformation Formats (UTFs)

SYNOPSIS

    use Unicode::Transform;

    $unicode_string = utf16be_to_unicode($utf16be_string);
    $utf16le_string = unicode_to_utf16le($unicode_string);
    $utf8_string    = utf32be_to_utf8   ($utf32be_string);

DESCRIPTION

This module provides some functions to convert a string among some Unicode Transformation Formats (UTFs).

Conversion Between UTF

(Exporting: use Unicode::Transform qw(:conv);)

Function names

A function name consists of SRC_UTF_NAME, string '_to_', and DST_UTF_NAME.

SRC_UTF_NAME (UTF name which a source string is in) and DST_UTF_NAME (UTF name which a return value is in) must be one in the list of hyphen-removed and lowercased names following:

    unicode    (for Perl's internal strings; see perlunicode)
    utf16le    (for UTF-16LE)
    utf16be    (for UTF-16BE)
    utf32le    (for UTF-32LE)
    utf32be    (for UTF-32BE)
    utf8       (for UTF-8)
    utf8mod    (for UTF-8-Mod)
    utfcp1047  (for CP-1047-oriented UTF-EBCDIC).

In all, 64 (i.e. 8 times 8) functions are available. Available function names include utf16be_to_utf32le(), utf8_to_unicode(). DST_UTF_NAME may be same as SRC_UTF_NAME like utf8_to_utf8().

Parameters

If the first parameter is a reference, that is CALLBACK, which is used for coping with illegal characters and octets. Any reference will not allowed as STRING.

If CALLBACK is given, STRING is the second parameter; otherwise the first. STRING is a source string. Currently, only coderefs are allowed as CALLBACK.

If CALLBACK is omitted, illegal code points and partial octets are deleted.

Illegal code points comprise surrogate code points [0xD800..0xDFFF] and out-of-range code points [0x110000 and greater]).

Partial octets are octets which do not represent any code point. They include the first octet without following octets in UTF-8 like "\xC2", the last octet in UTF-16BE,LE with odd-numbered octets.

If CALLBACK is specified, the appearance of an illegal code point or a partial octet calls the code reference. The first parameter for CALLBACK is the unsigned integer value of its code point; if the value is lesser than 256, that is a partial octet.

The return value from CALLBACK is inserted there.

(You can call die or croak in CALLBACK if you want to trap an ill-formed source.)

Conversion from Code Point to String

(Exporting: use Unicode::Transform qw(:chr);)

Returns the character represented by that CODEPOINT as the string in the Unicode transformation format. CODEPOINT can be in the range of 0..0x7FFF_FFFF. Returns a string even if CODEPOINT is a surrogate code point [0xD800..0xDFFF].

chr_utf16le() and chr_utf16be() returns undef when CODEPOINT is out of range [i.e., when 0x110000 and greater]).

chr_unicode(CODEPOINT)
chr_utf16le(CODEPOINT)
chr_utf16be(CODEPOINT)
chr_utf32le(CODEPOINT)
chr_utf32be(CODEPOINT)
chr_utf8(CODEPOINT)
chr_utf8mod(CODEPOINT)
chr_utfcp1047(CODEPOINT)

Numeric Value of the First Character

(Exporting: use Unicode::Transform qw(:ord);)

Returns an unsigned integer value of the first character of STRING. If STRING is empty or begins at a partial octet, returns undef.

STRING may begin at a surrogate code point [0xD800..0xDFFF] or an out-of-range code point [0x110000 and greater]).

ord_unicode(CODEPOINT)
ord_utf16le(CODEPOINT)
ord_utf16be(CODEPOINT)
ord_utf32le(CODEPOINT)
ord_utf32be(CODEPOINT)
ord_utf8(CODEPOINT)
ord_utf8mod(CODEPOINT)
ord_utfcp1047(CODEPOINT)

AUTHOR

SADAHIRO Tomoyuki <SADAHIRO@cpan.org>

  http://homepage1.nifty.com/nomenclator/perl/

  Copyright(C) 2002-2003, SADAHIRO Tomoyuki. Japan. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

perlunicode
UTF-EBCDIC (and UTF-8-Mod)

http://www.unicode.org/reports/tr16