Unicode::Japanese - Japanese Character Encoding Handler
use Unicode::Japanese qw(unijp);
# convert utf8 -> sjis
print unijp($str)->sjis; # same as avobe.
# convert sjis -> utf8
# convert sjis (imode_EMOJI) -> utf8
# convert ZENKAKU (utf8) -> HANKAKU (utf8)
Module for conversion among Japanese character encodings.
The instance stores internal strings in UTF-8.
Supports both XS and Non-XS. Use XS for high performance, or No-XS for ease to use (only by copying Japanese.pm).
Supports conversion between ZENKAKU and HANKAKU.
Safely handles "EMOJI" of the mobile phones (DoCoMo i-mode, ASTEL dot-i and J-PHONE J-Sky) by mapping them on Unicode Private Use Area.
Supports conversion of the same image of EMOJI between different mobile phone's standard mutually.
Considers Shift_JIS(SJIS) as MS-CP932. (Shift_JIS on MS-Windows (MS-SJIS/MS-CP932) differ from generic Shift_JIS encodings.)
On converting Unicode to SJIS (and EUC-JP/JIS), those encodings that cannot be converted to SJIS (except "EMOJI") are escaped in "&#dddd;" format. "EMOJI" on Unicode Private Use Area is going to be '?'. When converting strings from Unicode to SJIS of mobile phones, any characters not up to their standard is going to be '?'
On perl-5.8.0 and later, setting of utf-8 flag is performed properly. utf8() method returns utf-8 `bytes' string and getu() method returns utf-8 `char' string.
get() method returns utf-8 `bytes' string in current release. in future, the behavior of get() maybe change.
sjis(), jis(), utf8(), etc.. methods return bytes string. The input of new, set, and a getcode method is not asked about utf8-flaged/bytes.
Creates a new instance of Unicode::Japanese.
If arguments are specified, passes through to set method.
Same as Unicode::Janaese->new(...).
Set a string in the instance. If '$icode' is omitted, string is considered as UTF-8.
To specify a encodings, choose from the following; 'auto', 'utf8', 'ucs2', 'ucs4', 'utf16-be', 'utf16-le', 'utf16', 'utf32-be', 'utf32-le', 'utf32', 'jis', 'euc', 'euc-jp', 'sjis', 'cp932', 'sjis-imode', 'sjis-imode1', 'sjis-imode2', 'sjis-doti', 'sjis-doti1', 'sjis-jsky', 'sjis-jsky1', 'sjis-jsky2', 'jis-jsky', 'jis-jsky1', 'jis-jsky2', 'jis-au', 'jis-au1', 'jis-au2', 'sjis-au', 'sjis-au1', 'sjis-au2', 'sjis-icon-au', 'sjis-icon-au1', 'sjis-icon-au2', 'euc-icon-au', 'euc-icon-au1', 'euc-icon-au2', 'jis-icon-au', 'jis-icon-au1', 'jis-icon-au2', 'utf8-icon-au', 'utf8-icon-au1', 'utf8-icon-au2', 'ascii', 'binary'
For auto encoding detection, you MUST specify 'auto' so as to call getcode() method automatically.
For ASCII encoding, only 'base64' may be specified. With it, the string will be decoded before storing.
To decode binary, specify 'binary' as the encoding.
'&#dddd' will be converted to "EMOJI", when specified 'sjis-imode' or 'sjis-doti'.
In some cases, character encoding detection is misleaded because more than one encodings have same code points.
sjis is returned if a string is valid for both sjis and utf8. And sjis-au is return if a string is valid for both sjis-au and sjis-doti.
Gets a string with UTF-8.
return `bytes' string in current release, this behavior will be changed.
utf8() method for `character' string or getu() method for `bytes' string seems better.
On perl-5.8.0 and later, return value is with utf-8 flag.
Detects the character encodings of $str.
Notice: This method detects NOT encoding of the string in the instance but $str.
Character encodings are distinguished by the following algorithm:
(In case of PurePerl)
If BOM of UTF-32 is found, the encoding is utf32.
If BOM of UTF-16 is found, the encoding is utf16.
If it is in proper UTF-32BE, the encoding is utf32-be.
If it is in proper UTF-32LE, the encoding is utf32-le.
Without NON-ASCII characters, the encoding is ascii. (control codes except escape sequences has been included in ASCII)
If it includes ISO-2022-JP(JIS) escape sequences, the encoding is jis.
If it includes "J-PHONE EMOJI", the encoding is sjis-sky.
If it is in proper EUC-JP, the encoding is euc.
If it is in proper SJIS, the encoding is sjis.
If it is in proper SJIS and "EMOJI" of au, the encoding is sjis-au.
If it is in proper SJIS and "EMOJI" of i-mode, the encoding is sjis-imode.
If it is in proper SJIS and "EMOJI" of dot-i,the encoding is sjis-doti.
If it is in proper UTF-8, the encoding is utf8.
If none above is true, the encoding is unknown.
(In case of XS)
String is checked by State Transition if it is applicable for any listed encodings below.
ascii / euc-jp / sjis / jis / utf8 / utf32-be / utf32-le / sjis-jsky / sjis-imode / sjis-au / sjis-doti
The listed order below is applied for a final determination.
utf32-be / utf32-le / ascii / jis / euc-jp / sjis / sjis-jsky / sjis-imode / sjis-au / sjis-doti / utf8
Regarding the algorithm, pay attention to the following:
UTF-8 is occasionally detected as SJIS.
Can NOT detect UCS2 automatically.
Can detect UTF-16 only when the string has BOM.
Can detect "EMOJI" when it is stored in binary, not in "&#dddd;" format. (If only stored in "&#dddd;" format, getcode() will return incorrect result. In that case, "EMOJI" will be crashed.)
Because each of XS and PurePerl has a different algorithm, A result of the detection would be possibly different. In case that the string is SJIS with escape characters, it would be considered as SJIS on PurePerl. However, it can't be detected as S-JIS on XS. This is because by using Algorithm, the string can't be distinguished between SJIS and SJIS-Jsky. This exclusion of escape characters on XS from the detection is suppose to be the same for EUC-JP.
This function returns all acceptable character encodings.
This function returns copy of contained string in $ocode encoding.
Number at end of encoding names means emoji set version. Larger number is newer set. No number is same as newest set. Generally you may use without digits.
Gets a string converted to $ocode.
For ASCII encoding, only 'base64' may be specified. With it, the string encoded in base64 will be returned.
On perl-5.8.0 and later, return value is not with utf-8 flag, and is bytes string.
Replaces the substrings "&#dddd;" in the string with the binary entity they mean.
Converts ZENKAKU to HANKAKU.
Converts HANKAKU to ZENKAKU.
Converts HIRAGANA to KATAKANA.
Converts KATAKANA to HIRAGANA.
$str: string (JIS)
Gets the string converted to ISO-2022-JP(JIS).
$str: string (EUC-JP)
Gets the string converted to EUC-JP.
$str: `bytes' string (UTF-8)
Gets the string converted to UTF-8.
$str: string (UCS2)
Gets the string converted to UCS2.
$str: string (UCS4)
Gets the string converted to UCS4.
$str: string (UTF-16)
Gets the string converted to UTF-16(big-endian). BOM is not added.
$str: string (SJIS)
Gets the string converted to Shift_JIS(MS-SJIS/MS-CP932).
$str: string (SJIS/imode_EMOJI)
Gets the string converted to SJIS for i-mode. This method is alias of sjis_imode2.
Gets the string converted to SJIS for i-mode. $str includes only basic pictgraphs, and is without extended pictgraphs.
Gets the string converted to SJIS for i-mode. $str includes both basic pictgraphs, and extended ones.
$str: string (SJIS/dot-i_EMOJI)
Gets the string converted to SJIS for dot-i.
$str: string (SJIS/J-SKY_EMOJI)
Gets the string converted to SJIS for j-sky. This method is alias of sjis_jsky2 on VERSION 0.15.
Gets the string converted to SJIS for j-sky. $str includes from Page 1 to Page 3.
Gets the string converted to SJIS for j-sky. $str includes from Page 1 to Page 6.
$str: string (SJIS/AU-ICON-TAG)
Gets the string converted to SJIS for au.
Splits the string by length($len).
On perl-5.8.0 and later, each element in return array is with utf-8 flag.
$len: `visual width' of the string
Gets the length of the string. This method has been offered to substitute for perl build-in length(). ZENKAKU characters are assumed to have lengths of 2, regardless of the coding being SJIS or UTF-8.
@values: data array
Converts the array to a string in CSV format, then stores into the instance. In the meantime, adds a newline("\n") at the end of string.
Splits the string, accounting it is in CSV format. Each newline("\n") is removed before split.
on perl-5.8.0 and later, utf-8 flag of return value depends on icode of set method. if $s contains binary, return value is bytes too. if $s contains any string, return value is with utf-8 flag.
Translation is proceedede as follows.
Mapped as MS-CP932. Mapping table in the following URL is used.
If a character cannot be mapped to SJIS from Unicode, it will be converted to &#dddd; format. Pictgraphs are converted to "?";
Also, any unmapped character will be converted into "?" when converting to SJIS for mobile phones.
Converted to SJIS and then mapped to Unicode. Any non-SJIS character in the string will not be mapped correctly.
Portion of involving "EMOJI" in F800 - F9FF is maapped to U+0FF800 - U+0FF9FF.
Portion of involving "EMOJI" in F000 - F4FF is mapped to U+0FF000 - U+0FF4FF.
"J-SKY EMOJI" are mapped down as follows: "\e\$"(\x1b\x24) escape sequences, the first byte, the second byte and "\x0f". With sequential "EMOJI"s of identical first bytes, it may be compressed by arranging only the second bytes.
4500 - 47FF is mapped to U+0FFB00 - U+0FFDFF, accounting the first and the second bytes make one EMOJI character.
Unicode::Japanese will compress "J-SKY_EMOJI" automatically when the first bytes of a sequence of "EMOJI" are identical.
Portion of involving "EMOJI" is mapped to U+0FF500 - U+0FF6FF.
use Unicode::Japanese qw(PurePerl);
If module was loaded with 'PurePerl' keyword, it works on Non-XS mode.
EUC-JP, JIS strings cannot be converted correctly when they include non-SJIS characters because they are converted to SJIS before being converted to UTF-8.
When using XS, character encoding detection of EUC-JP and SJIS(included all EMOJI) strings when they include "\e" will fail. Also, getcode() and all convert method will not work.
The Japanese.pm file will collapse if sent via ASCII mode of FTP, as it has a trailing binary data.
Copyright 2001-2007 SANO Taku (SAWATARI Mikage) and YAMASHINA Hio. All right reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Bug reports and comments to: email@example.com. Thank you.
Or, report any bugs or feature requests to bug-unicode-japanese at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Unicode-Japanese. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-unicode-japanese at rt.cpan.org
You can find documentation for this module with the perldoc command.
You can also look for information at:
AnnoCPAN: Annotated CPAN documentation
RT: CPAN's request tracker
Thanks very much to:
SUGIURA Tatsuki & Debian JP Project
Copyright 2001-2007 SANO Taku (SAWATARI Mikage) and YAMASHINA Hio, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Unicode::Japanese::PurePerl, copy and paste the appropriate command in to your terminal.
perl -MCPAN -e shell
For more information on module installation, please visit the detailed CPAN module installation guide.