The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

ShiftJIS::CP932::MapUTF - transcode between Microsoft CP932 and Unicode

SYNOPSIS

    use ShiftJIS::CP932::MapUTF qw(:all);

    $utf8_string  = cp932_to_utf8($cp932_string);
    $cp932_string = utf8_to_cp932($utf8_string);

DESCRIPTION

The table of Microsoft Windows CodePage 932 (CP-932) comprises 7915 characters:

    JIS X 0201 single-byte characters (191 characters),
    JIS X 0208 double-byte characters (6879 characters),
    NEC special characters (83 characters, row 13),
    NEC-selected IBM extended characters (374 characters, rows 89..92),
    and IBM extended characters (388 characters, rows 115..119).

This table includes duplicates that do not round trip map. These duplicates are due to the characters defined by vendors, NEC and IBM. For example, there are two characters that are mapped to U+2252 in Unicode; i.e., 0x81e0 (a JIS X 0208 character) and 0x8790 (an NEC special character).

Actually, 7915 characters in CP-932 must be mapped to 7517 characters in Unicode. There are 398 non-round-trip mappings.

This module provides some functions to convert properly from CP-932 to Unicode, and vice versa.

Transcoding from CP-932 to Unicode

If the first parameter is a reference, that is used for coping with CP-932 characters unmapped to Unicode, SJIS_CALLBACK. (any reference will not allowed as STRING.)

If SJIS_CALLBACK is given, STRING is the second parameter; otherwise the first.

If SJIS_CALLBACK is not specified, CP-932 characters unmapped to Unicode are silently deleted and partial bytes are skipped by one byte. (as if a coderef constantly returning null string, sub {''}, is passed as SJIS_CALLBACK.)

Currently, only coderefs are allowed as SJIS_CALLBACK. A string returned from SJIS_CALLBACK is inserted in place of the unmapped character.

A coderef as SJIS_CALLBACK is called with one or more arguments. If the unmapped character is a partial double-byte character (i.e. a string with onebyte length of leading byte), the first argument is undef and the second argument is an unsigned integer representing the byte. If the unmapped character is not partial, the first argument is a defined string representing a character.

By default, a partial double-byte character may appear only at the end of STRING; does not in the beginning nor in the middle (see also 't' of SJIS_OPTION).

Example

    my $sjis_callback = sub {
        my ($char, $byte) = @_;
        return function($char) if defined $char;
        die sprintf "found partial byte 0x%02x", $byte;
    };

In the example above, $char may be one of "\x80", "\x82\xf2", "\xfc\xfc", "\xff".

The return value of SJIS_CALLBACK must be legal in the target format. E.g. never use with cp932_to_utf16be() a callback that returns UTF-8. I.e. you should prepare SJIS_CALLBACK for each UTF.

SJIS_OPTION may be specified after STRING. They can be combined like 'tg' and 'gst' (the order does not matter).

    'g'    add mappings of Gaiji (user defined characters)
           [0xF040 to 0xF9FC (rows 95 to 114) in CP-932]
           to Unicode's PUA [0xE000 to 0xE757] (1880 characters).

    's'    add mappings of undefined Single-byte characters:
           0x80 => U+0080,  0xA0 => U+F8F0,
           0xFD => U+F8F1,  0xFE => U+F8F2,  0xFF => U+F8F3.

    't'    check the Trailing byte range [0x40..0x7E, 0x80..0xFC].
           E.g. "\x81\x39" is regarded as an undefined double-byte character
           by default; with 't', it is a partial character byte 0x81
           followed by a single-byte character "\x39".
cp932_to_utf8([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to UTF-8.

cp932_to_unicode([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to Unicode. (Perl's internal format, flagged with SVf_UTF8, see perlunicode)

This function is provided only for Perl 5.6.1 or later, and via XS.

cp932_to_utf16le([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to UTF-16LE.

cp932_to_utf16be([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to UTF-16BE.

cp932_to_utf32le([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to UTF-32LE.

cp932_to_utf32be([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to UTF-32BE.

Transcoding from Unicode to CP-932

Any duplicates are converted according to Microsoft PRB Q170559. E.g. U+2252 is converted to "\x81\xE0", not to "\x87\x90".

If the first parameter is a reference, that is used for coping with Unicode characters unmapped to CP-932, UNICODE_CALLBACK. (any reference will not allowed as STRING.)

If UNICODE_CALLBACK is given, STRING is the second parameter; otherwise the first.

If UNICODE_CALLBACK is not specified, CP-932 characters unmapped to Unicode are silently deleted and partial bytes are skipped by one byte. (as if a coderef constantly returning null string, sub {''} is passed as UNICODE_CALLBACK.)

Currently, only coderefs are allowed as UNICODE_CALLBACK. A string returned from the coderef is inserted in place of the unmapped character.

A coderef as UNICODE_CALLBACK is called with one or more arguments. If the unmapped character is a partial character (an illegal byte), the first argument is undef and the second argument is an unsigned integer representing the byte. If not partial, the first argument is an unsigned interger representing a Unicode code point.

For example, characters unmapped to CP-932 are converted to numerical character references for HTML 4.01.

    sub toHexNCR {
        my ($char, $byte) = @_;
        return sprintf("&#x%x;", $char) if defined $char;
        die sprintf "illegal byte 0x%02x was found", $byte;
    }

    $cp932 = utf8_to_cp932   (\&toHexNCR, $utf8_string);
    $cp932 = unicode_to_cp932(\&toHexNCR, $unicode_string);
    $cp932 = utf16le_to_cp932(\&toHexNCR, $utf16le_string);

The return value of UNICODE_CALLBACK must be legal in CP-932.

UNICODE_OPTION may be specified after STRING. They can be combined like 'fg' and 'gsf' (the order does not matter).

    'g'    add mappings of Gaiji (user defined characters)
           [0xF040 to 0xF9FC (rows 95 to 114) in CP-932]
           from Unicode's PUA [0xE000 to 0xE757] (1880 characters).

    's'    add mappings of undefined Single-byte characters:
           U+0080 => 0x80,  U+F8F0 => 0xA0,
           U+F8F1 => 0xFD,  U+F8F2 => 0xFE,  U+F8F3 => 0xFF.

    'f'    add some Fallback mappings from Unicode to CP-932.
           The characters additionally mapped are
           some characters in latin-1 region [U+00A0..U+00FF], and
           HIRAGANA LETTER VU [U+3094, to KATAKANA LETTER VU, 0x8394].
utf8_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-8 to CP-932.

unicode_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts Unicode to CP-932.

This Unicode is coded in the Perl's internal format (see perlunicode). If not flagged with SVf_UTF8, upgraded as an ISO 8859-1 (latin1) string.

This function is provided only for Perl 5.6.1 or later, and via XS.

utf16_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-16 (with or w/o BOM) to CP-932.

utf16le_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-16LE to CP-932.

utf16be_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-16BE to CP-932.

utf32_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-32 (with or w/o BOM) to CP-932.

utf32le_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-32LE to CP-932.

utf32be_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-32BE to CP-932.

Export

By default:

    cp932_to_utf8     utf8_to_cp932
    cp932_to_utf16le  utf16le_to_cp932
    cp932_to_utf16be  utf16be_to_cp932

    cp932_to_unicode  unicode_to_cp932 (only for XS)

On request:

    cp932_to_utf32le  utf32le_to_cp932
    cp932_to_utf32be  utf32be_to_cp932
                      utf16_to_cp932 [*]
                      utf32_to_cp932 [*]

[*] Their counterparts cp932_to_utf16() and cp932_to_utf32() are not implemented yet. They need more investigation on return values from SJIS_CALLBACK... (concatenation needs recognition of and coping with BOM)

CAVEAT

Pure Perl edition of this module doesn't understand any logically wide characters (see perlunicode). Use utf8::decode/utf8::encode (see utf8) on Perl 5.7 or later if necessary.

AUTHOR

SADAHIRO Tomoyuki <SADAHIRO@cpan.org>

Copyright(C) 2001-2007, SADAHIRO Tomoyuki. Japan. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Microsoft PRB, Article ID: Q170559

Conversion Problem Between Shift-JIS and Unicode

cp932 to Unicode table

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt

http://www.microsoft.com/globaldev/reference/dbcs/932.htm