The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Unicode::Japanese - Japanese Character Encoding Handler

SYNOPSIS

use Unicode::Japanese;

# convert utf8 -> sjis

print Unicode::Japanese->new($str)->sjis;

# convert sjis -> utf8

print Unicode::Japanese->new($str,'sjis')->get;

# convert sjis (imode_EMOJI) -> utf8

print Unicode::Japanese->new($str,'sjis-imode')->get;

# convert ZENKAKU (utf8) -> HANKAKU (utf8)

print Unicode::Japanese->new($str)->z2h->get;

DESCRIPTION

Module for conversion among Japanese character encodings.

FEATURES

  • The instance stores internal strings in UTF-8.

  • Supports both XS and Non-XS. Use XS for high performance, or No-XS for ease to use (only by copying Japanese.pm).

  • Supports conversion between ZENKAKU and HANKAKU.

  • Safely handles "EMOJI" of the mobile phones (DoCoMo i-mode, ASTEL dot-i and J-PHONE J-Sky) by mapping them on Unicode Private Use Area.

  • Supports conversion of the same image of EMOJI between different mobile phone's standard mutually.

  • Considers Shift_JIS(SJIS) as MS-CP932. (Shift_JIS on MS-Windows (MS-SJIS/MS-CP932) differ from generic Shift_JIS encodings.)

  • On converting Unicode to SJIS (and EUC-JP/JIS), those encodings that cannot be converted to SJIS (except "EMOJI") are escaped in "&#dddd;" format. "EMOJI" on Unicode Private Use Area is going to be '?'. When converting strings from Unicode to SJIS of mobile phones, any characters not up to their standard is going to be '?'

  • On perl-5.8.0 and later, setting of utf-8 flag is performed properly. utf8() method returns utf-8 `bytes' string and getu() method returns utf-8 `char' string.

    get() method returns utf-8 `bytes' string in current release. in future, the behavior of get() maybe change.

    sjis(), jis(), utf8(), etc.. methods return bytes string. The input of new, set, and a getcode method is not asked about utf8/bytes.

METHODS

$s = Unicode::Japanese->new($str [, $icode [, $encode]])

Creates a new instance of Unicode::Japanese.

If arguments are specified, passes through to set method.

$s->set($str [, $icode [, $encode]])
$str: string
$icode: character encodings, may be omitted (default = 'utf8')
$encode: ASCII encoding, may be omitted.

Set a string in the instance. If '$icode' is omitted, string is considered as UTF-8.

To specify a encodings, choose from the following; 'jis', 'sjis', 'euc', 'utf8', 'ucs2', 'ucs4', 'utf16', 'utf16-ge', 'utf16-le', 'utf32', 'utf32-ge', 'utf32-le', 'ascii', 'binary', 'sjis-imode', 'sjis-doti', 'sjis-jsky'.

'&#dddd' will be converted to "EMOJI", when specified 'sjis-imode' or 'sjis-doti'.

For auto encoding detection, you MUST specify 'auto' so as to call getcode() method automatically.

For ASCII encoding, only 'base64' may be specified. With it, the string will be decoded before storing.

To decode binary, specify 'binary' as the encoding.

$str = $s->get
$str: string (UTF-8)

Gets a string with UTF-8.

return `bytes' string in current release, this behavior will be changed.

utf8() method for `character' string or getu() method for `bytes' string seems better.

$str = $s->getu
$str: string (UTF-8)

Gets a string with UTF-8.

On perl-5.8.0 and later, return value is with utf-8 flag.

$code = $s->getcode($str)
$str: string
$code: character encoding name

Detects the character encodings of $str.

Notice: This method detects NOT encoding of the string in the instance but $str.

Character encodings are distinguished by the following algorithm:

(In case of PurePerl)

  1. If BOM of UTF-32 is found, the encoding is utf32.

  2. If BOM of UTF-16 is found, the encoding is utf16.

  3. If it is in proper UTF-32BE, the encoding is utf32-be.

  4. If it is in proper UTF-32LE, the encoding is utf32-le.

  5. Without NON-ASCII characters, the encoding is ascii. (control codes except escape sequences has been included in ASCII)

  6. If it includes ISO-2022-JP(JIS) escape sequences, the encoding is jis.

  7. If it includes "J-PHONE EMOJI", the encoding is sjis-sky.

  8. If it is in proper EUC-JP, the encoding is euc.

  9. If it is in proper SJIS, the encoding is sjis.

  10. If it is in proper SJIS and "EMOJI" of i-mode, the encoding is sjis-imode.

  11. If it is in proper SJIS and "EMOJI" of dot-i,the encoding is sjis-doti.

  12. If it is in proper UTF-8, the encoding is utf8.

  13. If none above is true, the encoding is unknown.

(In case of XS)

  1. If BOM of UTF-32 is found, the encoding is utf32.

  2. If BOM of UTF-16 is found, the encoding is utf16.

  3. String is checked by State Transition if it is applicable for any listed encodings below.

    ascii / euc-jp / sjis / jis / utf8 / utf32-be / utf32-le / sjis-jsky / sjis-imode / sjis-doti

  4. The listed order below is applied for a final determination.

    utf32-be / utf32-le / ascii / jis / euc-jp / sjis / sjis-jsky / sjis-imode / sjis-doti / utf8

  5. If none above is true, the encoding is unknown.

Regarding the algorithm, pay attention to the following:

  • UTF-8 is occasionally detected as SJIS.

  • Can NOT detect UCS2 automatically.

  • Can detect UTF-16 only when the string has BOM.

  • Can detect "EMOJI" when it is stored in binary, not in "&#dddd;" format. (If only stored in "&#dddd;" format, getcode() will return incorrect result. In that case, "EMOJI" will be crashed.)

Because each of XS and PurePerl has a different algorithm, A result of the detection would be possibly different. In case that the string is SJIS with escape characters, it would be considered as SJIS on PurePerl. However, it can't be detected as S-JIS on XS. This is because by using Algorithm, the string can't be distinguished between SJIS and SJIS-Jsky. This exclusion of escape characters on XS from the detection is suppose to be the same for EUC-JP.

$str = $s->conv($ocode, $encode)
$ocode: output character encoding (Choose from 'jis', 'sjis', 'euc', 'utf8', 'ucs2', 'ucs4', 'utf16', 'binary')
$encode: encoding, may be omitted.
$str: string

Gets a string converted to $ocode.

For ASCII encoding, only 'base64' may be specified. With it, the string encoded in base64 will be returned.

On perl-5.8.0 and later, return value is not with utf-8 flag, and is bytes string.

$s->tag2bin

Replaces the substrings "&#dddd;" in the string with the binary entity they mean.

$s->z2h

Converts ZENKAKU to HANKAKU.

$s->h2z

Converts HANKAKU to ZENKAKU.

$s->hira2kata

Converts HIRAGANA to KATAKANA.

$s->kata2hira

Converts KATAKANA to HIRAGANA.

$str = $s->jis

$str: string (JIS)

Gets the string converted to ISO-2022-JP(JIS).

$str = $s->euc

$str: string (EUC-JP)

Gets the string converted to EUC-JP.

$str = $s->utf8

$str: `bytes' string (UTF-8)

Gets the string converted to UTF-8.

On perl-5.8.0 and later, return value is not with utf-8 flag, and is bytes string.

$str = $s->ucs2

$str: string (UCS2)

Gets the string converted to UCS2.

$str = $s->ucs4

$str: string (UCS4)

Gets the string converted to UCS4.

$str = $s->utf16

$str: string (UTF-16)

Gets the string converted to UTF-16(big-endian). BOM is not added.

$str = $s->sjis

$str: string (SJIS)

Gets the string converted to Shift_JIS(MS-SJIS/MS-CP932).

$str = $s->sjis_imode

$str: string (SJIS/imode_EMOJI)

Gets the string converted to SJIS for i-mode. This method is alias of sjis_imode2 on VERSION 0.15.

$str = $s->sjis_imode1

$str: string (SJIS/imode_EMOJI)

Gets the string converted to SJIS for i-mode. $str includes only basic pictgraphs, and is without extended pictgraphs.

$str = $s->sjis_imode2

$str: string (SJIS/imode_EMOJI)

Gets the string converted to SJIS for i-mode. $str includes both basic pictgraphs, and extended ones.

$str = $s->sjis_doti

$str: string (SJIS/dot-i_EMOJI)

Gets the string converted to SJIS for dot-i.

$str = $s->sjis_jsky

$str: string (SJIS/J-SKY_EMOJI)

Gets the string converted to SJIS for j-sky. This method is alias of sjis_jsky2 on VERSION 0.15.

$str = $s->sjis_jsky1

$str: string (SJIS/J-SKY_EMOJI)

Gets the string converted to SJIS for j-sky. $str includes from Page 1 to Page 3.

$str = $s->sjis_jsky

$str: string (SJIS/J-SKY_EMOJI)

Gets the string converted to SJIS for j-sky. $str includes from Page 1 to Page 6.

@str = $s->strcut($len)
$len: number of characters
@str: strings

Splits the string by length($len).

On perl-5.8.0 and later, each element in return array is with utf-8 flag.

$len = $s->strlen

$len: `visual width' of the string

Gets the length of the string. This method has been offered to substitute for perl build-in length(). ZENKAKU characters are assumed to have lengths of 2, regardless of the coding being SJIS or UTF-8.

$s->join_csv(@values);

@values: data array

Converts the array to a string in CSV format, then stores into the instance. In the meantime, adds a newline("\n") at the end of string.

@values = $s->split_csv;

@values: data array

Splits the string, accounting it is in CSV format. Each newline("\n") is removed before split.

on perl-5.8.0 and later, utf-8 flag of return value depends on icode of set method. if $s contains binary, return value is bytes too. if $s contains any string, return value is with utf-8 flag.

DESCRIPTION OF UNICODE MAPPING

SJIS

Mapped as MS-CP932. Mapping table in the following URL is used.

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

If a character cannot be mapped to SJIS from Unicode, it will be converted to &#dddd; format.

Also, any unmapped character will be converted into "?" when converting to SJIS for mobile phones.

EUC-JP/JIS

Converted to SJIS and then mapped to Unicode. Any non-SJIS character in the string will not be mapped correctly.

DoCoMo i-mode

Portion of involving "EMOJI" in F800 - F9FF is maapped to U+0FF800 - U+0FF9FF.

ASTEL dot-i

Portion of involving "EMOJI" in F000 - F4FF is mapped to U+0FF000 - U+0FF4FF.

J-PHONE J-SKY

"J-SKY EMOJI" are mapped down as follows: "\e\$"(\x1b\x24) escape sequences, the first byte, the second byte and "\x0f". With sequential "EMOJI"s of identical first bytes, it may be compressed by arranging only the second bytes.

4500 - 47FF is mapped to U+0FFB00 - U+0FFDFF, accounting the first and the second bytes make one EMOJI character.

Unicode::Japanese will compress "J-SKY_EMOJI" automatically when the first bytes of a sequence of "EMOJI" are identical.

PurePerl mode

   use Unicode::Japanese qw(PurePerl);

If module was loaded with 'PurePerl' keyword, it works on Non-XS mode.

BUGS

  • EUC-JP, JIS strings cannot be converted correctly when they include non-SJIS characters because they are converted to SJIS before being converted to UTF-8.

  • Some characters of CP932 not in standard Shift_JIS (ex; not in Joyo Kanji) will not be detected and converted.

    When string include such non-standard Shift_JIS, they will not detected as SJIS. Also, getcode() and all convert method will not work correctly.

  • When using XS, character encoding detection of EUC-JP and SJIS(included all EMOJI) strings when they include "\e" will fail. Also, getcode() and all convert method will not work.

  • The Japanese.pm file will collapse if sent via ASCII mode of FTP, as it has a trailing binary data.

AUTHOR INFORMATION

Copyright 2001-2004 SANO Taku (SAWATARI Mikage) and YAMASHINA Hio. All right reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Bug reports and comments to: mikage@cpan.org. Thank you.

CREDITS

Thanks very much to:

NAKAYAMA Nao

SUGIURA Tatsuki & Debian JP Project