ShiftJIS::CP932::MapUTF - conversion between Microsoft Windows CP-932 and Unicode
use ShiftJIS::CP932::MapUTF qw(:all); $utf8_string = cp932_to_utf8($cp932_string); $cp932_string = utf8_to_cp932($utf8_string);
The table of Microsoft Windows CodePage 932 (CP-932) comprises 7915 characters:
JIS X 0201 single-byte graphic characters (159 characters), JIS X 0211 single-byte control characters (32 characters), JIS X 0208 double-byte graphic characters (6879 characters), NEC special characters (83 characters, row 13), NEC-selected IBM extended characters (374 characters, rows 89..92), and IBM extended characters (388 characters, rows 115..119).
This table includes duplicates that do not round trip map. These duplicates are due to the characters defined by vendors, NEC and IBM. For example, there are two characters that are mapped to U+2252 in Unicode; i.e., 0x81e0 (a JIS X 0208 character) and 0x8790 (an NEC special character).
U+2252
0x81e0
0x8790
Actually, 7915 characters in CP-932 must be mapped to 7517 characters in Unicode. There are 398 non-round-trip mappings; i.e.
This module provides some functions to map properly from CP-932 to Unicode, and vice versa.
If the first parameter is a reference, that are used for coping with CP-932 characters unmapped to Unicode, SJIS_CALLBACK. (any reference will not allowed as STRING.)
SJIS_CALLBACK
STRING
If SJIS_CALLBACK is given, the second parameter is used as STRING; otherwise the first.
If SJIS_CALLBACK is not specified, CP-932 characters unmapped to Unicode are silently deleted and partial bytes are skipped by one byte.
Currently, only coderefs are used as SJIS_CALLBACK. A string returned from SJIS_CALLBACK is inserted in place of unmapped characters.
A coderef as SJIS_CALLBACK is called with one or more arguments. If the unmapped character is a partial double-byte character (i.e. the leading byte string), the first argument is undef and the second argument is an unsigned integer representing the byte. If not partial, the first argument is a defined string representing a character.
undef
Example
my $sjis_callback = sub { my ($char, $byte) = @_; return function($char) if defined $char; die sprintf "found partial byte 0x%02x", $byte; };
In the example above, $char may be one of "\x80", "\x82\xf2", "\xfc\xfc", "\xff".
$char
"\x80"
"\x82\xf2"
"\xfc\xfc"
"\xff"
The return value of SJIS_CALLBACK must be legal in the target format. E.g. never use with cp932_to_utf16be() a callback that returns UTF-8. I.e. you should prepare SJIS_CALLBACK for each UTF.
cp932_to_utf16be()
SJIS_OPTION may be specified after STRING. They can be combined like 'tg' and 'gst' (the order does not matter).
SJIS_OPTION
'tg'
'gst'
'g' add mapping of CP-932 gaiji (user defined characters) [0xF040 to 0xF9FC (rows 95 to 114)] to Unicode's PUA [0xE000 to 0xE757]�i1880 characters�j. 's' add mapping of CP-932 undefined single-byte characters: 0x80 => U+0080, 0xA0 => U+F8F0, 0xFD => U+F8F1, 0xFE => U+F8F2, 0xFF => U+F8F3. 't' check trailing byte ranges [0x40..0x7E, 0x80..0xFC]. I.e. "\x81\x39" is assumed as an undefined double-byte character by default; with 't', it is a partial byte 0x81 followed by a single-byte character "\x39".
cp932_to_utf8([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
Converts CP-932 to UTF-8.
cp932_to_unicode([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
Converts CP-932 to Unicode. (Perl's internal format, flagged with SVf_UTF8, see perlunicode)
SVf_UTF8
This function is provided only for Perl 5.6.1 or later, and via XS.
cp932_to_utf16le([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
Converts CP-932 to UTF-16LE.
cp932_to_utf16be([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
Converts CP-932 to UTF-16BE.
cp932_to_utf32le([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
Converts CP-932 to UTF-32LE.
cp932_to_utf32be([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
Converts CP-932 to UTF-32BE.
Any duplicates are converted according to Microsoft PRB Q170559. E.g. U+2252 is converted to \x81\xe0, not to \x87\x90.
\x81\xe0
\x87\x90
If the first parameter is a reference, that are used for coping with Unicode characters unmapped to CP-932, UNICODE_CALLBACK. (any reference will not allowed as STRING.)
UNICODE_CALLBACK
If UNICODE_CALLBACK is given, the second parameter is used as STRING; otherwise the first.
If UNICODE_CALLBACK is not specified, CP-932 characters unmapped to Unicode are silently deleted and partial bytes are skipped by one byte.
Currently, only coderefs are used as UNICODE_CALLBACK. A string returned from the coderef is inserted in place of unmapped characters.
A coderef as UNICODE_CALLBACK is called with one or more arguments. If the unmapped character is a partial character (an illegal byte), the first argument is undef and the second argument is an unsigned integer representing the byte. If not partial, the first argument is an unsigned interger representing a Unicode code point.
For example, characters unmapped to CP-932 are converted to numerical character references for HTML 4.01.
sub toHexNCR { my ($char, $byte) = @_; return sprintf("&#x%x;", $char) if defined $char; die sprintf "illegal byte 0x%02x was found", $byte; } $cp932 = utf8_to_cp932 (\&toHexNCR, $utf8_string); $cp932 = unicode_to_cp932(\&toHexNCR, $unicode_string); $cp932 = utf16le_to_cp932(\&toHexNCR, $utf16le_string);
The return value of UNICODE_CALLBACK must be legal in CP-932.
UNICODE_OPTION may be specified after STRING. They can be combined like 'fg' and 'gsf' (the order does not matter).
UNICODE_OPTION
'fg'
'gsf'
'g' add mapping of CP-932 gaiji (user defined characters) [0xF040 to 0xF9FC (rows 95 to 114)] from Unicode's PUA [0xE000 to 0xE757]�i1880 characters�j. 's' add mapping of CP-932 undefined single-byte characters: U+0080 => 0x80, U+F8F0 => 0xA0, U+F8F1 => 0xFD, U+F8F2 => 0xFE, U+F8F3 => 0xFF. 'f' add some fallback mappings from Unicode to CP-932.
utf8_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
Converts UTF-8 to CP-932.
unicode_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
Converts Unicode to CP-932.
This Unicode is in the Perl's internal format (see perlunicode). If not flagged with SVf_UTF8, upgraded as an ISO 8859-1 string).
utf16_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
Converts UTF-16 (with or w/o BOM) to CP-932.
BOM
utf16le_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
Converts UTF-16LE to CP-932.
utf16be_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
Converts UTF-16BE to CP-932.
utf32_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
Converts UTF-32 (with or w/o BOM) to CP-932.
utf32le_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
Converts UTF-32LE to CP-932.
utf32be_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
Converts UTF-32BE to CP-932.
By default:
cp932_to_utf8 utf8_to_cp932 cp932_to_utf16le utf16le_to_cp932 cp932_to_utf16be utf16be_to_cp932 cp932_to_unicode unicode_to_cp932 (only for XS)
On request:
cp932_to_utf32le utf32le_to_cp932 cp932_to_utf32be utf32be_to_cp932 utf16_to_cp932 [*] utf32_to_cp932 [*]
[*] Their counterparts cp932_to_utf16() and cp932_to_utf32() are not implemented yet. They needs more investigation on return values from SJIS_CALLBACK... (concatenation needs recognition of and coping with BOM)
cp932_to_utf16()
cp932_to_utf32()
Pure Perl edition of this module doesn't understand any logically wide characters (see perlunicode). Use utf8::decode/utf8::encode (see utf8.pm) on Perl 5.7 or later if necessary.
utf8::decode
utf8::encode
SADAHIRO, Tomoyuki
SADAHIRO@cpan.org http://homepage1.nifty.com/nomenclator/perl/ Copyright(C) 2001-2003, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Conversion Problem Between Shift-JIS and Unicode
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://www.microsoft.com/typography/unicode/932.txt (dead link)
http://oss.software.ibm.com/cvs/icu/charset/data/xml/windows-932-2000.xml
http://oss.software.ibm.com/cvs/icu/charset/data/ucm/windows-932-2000.ucm
1 POD Error
The following errors were encountered while parsing the POD:
Non-ASCII character seen before =encoding in '0xE757]�i1880'. Assuming CP1252
To install ShiftJIS::CP932::MapUTF, copy and paste the appropriate command in to your terminal.
cpanm
cpanm ShiftJIS::CP932::MapUTF
CPAN shell
perl -MCPAN -e shell install ShiftJIS::CP932::MapUTF
For more information on module installation, please visit the detailed CPAN module installation guide.