The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::JA::Moji - Handle many kinds of Japanese characters

SYNOPSIS

Convert various types of Japanese characters into one another.

    use Lingua::JA::Moji qw/kana2romaji romaji2kana/;
    use utf8;
    my $romaji = kana2romaji ('あいうえお');
    # $romaji is now 'aiueo'.
    my $kana = romaji2kana ($romaji);
    # $kana is now 'アイウエオ'.

description

This module provides methods to convert different written forms of Japanese into one another.

All the functions in this module assume the use of Unicode encoding. All input and output strings must be encoded using UTF-8.

ROMANIZATION

These functions convert Japanese letters to and from romanized forms.

kana2romaji -- Convert kana to romaji

    use Lingua::JA::Moji 'kana2romaji';

    $romaji = kana2romaji ("うれしいこども");
    # Now $romaji = 'uresîkodomo'

Convert kana to a romanized form.

An optional second argument, a hash reference, controls the style of conversion.

    use utf8;
    $romaji = kana2romaji ("しんぶん", {style => "hepburn"});
    # $romaji = "shimbun"

The possible options are

style

The style of romanization. The default form of romanization is "Nippon-shiki". See http://www.sljfaq.org/afaq/nippon-shiki.html. The user can set the conversion style to "hepburn" or "passport" or "kunrei". See http://www.sljfaq.org/afaq/kana-roman.html.

use_m

If this is set to any "true" value, syllabic ns (ん) which come before "b" or "p" sounds, such as the first "n" in "shinbun" (しんぶん, newspaper) will be converted into "m" rather than "n".

ve_type

ve_type controls how long vowels are written. The default is to use circumflexes to represent long vowels. If you set "ve_type" => "macron", then it uses macrons (the Hepburn system). If you set "ve_type" => "passport", then it uses "oh" to write long "o" vowels. If you set "ve_type" => "none", then it does not use "h".

romaji2kana -- Convert romaji to kana

    use Lingua::JA::Moji 'romaji2kana';

    $kana = romaji2kana ('yamaguti');
    # Now $kana = 'ヤマグチ'

Convert romanized Japanese to kana. The romanization is highly liberal and will attempt to convert any romanization it sees into kana. To convert romanized Japanese into hiragana, use "romaji2hiragana".

The second argument to the function contains options in the form of a hash reference,

     $kana = romaji2kana ($romaji, {wapuro => 1});

Use an option wapuro => 1 to convert long vowels into the equivalent kana rather than chouon.

romaji2hiragana -- Convert romaji to hiragana

    use Lingua::JA::Moji 'romaji2hiragana';

    $hiragana = romaji2hiragana ('babubo');
    # Now $hiragana = 'ばぶぼ'

Convert romanized Japanese into hiragana. This takes the same options as "romaji2kana". It also switches on the "wapuro" option which makes the use of long vowels with a kana rather than a chouon (long vowel marker).

romaji_styles

    use Lingua::JA::Moji 'romaji_styles';

    my @styles = romaji_styles ();
    # Returns a true value
    romaji_styles ("hepburn");
    # Returns the undefined value
    romaji_styles ("frogs");

Given an argument, return whether it is a legitimate style of romanization.

Without an argument, return a list of possible styles, as an array of hash values, with each hash element containing "abbrev" as a short name and "full_name" for the full name of the style.

is_voiced

    use Lingua::JA::Moji 'is_voiced';

    if (is_voiced ('が')) {
         print "が is voiced.\n";
    }

Given a kana or romaji input, is_voiced returns a true value if the sound is a voiced sound like a, za, ga, etc. and the undefined value if not.

is_romaji

    use Lingua::JA::Moji 'is_romaji';

    # The following line returns "undef"
    is_romaji ("abcdefg");
    # The following line returns a defined value
    is_romaji ("atarimae");

Detect whether a string of alphabetical characters, which may also include characters with macrons or circumflexes, "looks like" romanized Japanese. If the test is successful, returns the romaji in a canonical form.

This functions by converting the string to kana and seeing if it converts cleanly or not.

normalize_romaji

    use Lingua::JA::Moji 'normalize_romaji';

    $normalized = normalize_romaji ('tsumuji');

normalize_romaji converts romanized Japanese to a canonical form, which is based on the Nippon-shiki romanization, but without representing long vowels using a circumflex. In the canonical form, sokuon (っ) characters are converted into the string "xtu".

If there is kana in the input string, this will also be converted to romaji.

normalize_romaji is for comparing two Japanese words which may be represented in different ways, for example in different romanization systems, to see if they refer to the same word despite the difference in writing. It does not provide a standardized or officially-sanctioned form of romanization.

KANA

hira2kata -- Convert hiragana to katakana

    use Lingua::JA::Moji 'hira2kata';

    $katakana = hira2kata ('ひらがな');
    # Now $katakana = 'ヒラガナ'

hira2kata converts hiragana into katakana. If the input is a list, it converts each element of the list, and if required, returns a list of the converted inputs, otherwise it returns a concatenation of the strings.

    my @katakana = hira2kata (@hiragana);

This does not convert chouon signs.

kata2hira -- Convert katakana to hiragana

    use Lingua::JA::Moji 'kata2hira';

    $hiragana = kata2hira ('カキクケコ');
    # Now $hiragana = 'かきくけこ'

kata2hira converts full-width katakana into hiragana. If the input is a list, it converts each element of the list, and if required, returns a list of the converted inputs, otherwise it returns a concatenation of the strings.

    my @hiragana = hira2kata (@katakana);

This function does not convert chouon signs into long vowels. It also does not convert half-width katakana into hiragana.

InHankakuKatakana

    use Lingua::JA::Moji 'InHankakuKatakana';

    use utf8;
    if ('ア' =~ /\p{InHankakuKatakana}/) {
        print "ア is half-width katakana\n";
    }

InHankakuKatakana is a character class for use in regular expressions with \p which can validate halfwidth katakana.

kana2hw -- Convert kana to halfwidth katakana

    use Lingua::JA::Moji 'kana2hw';

    $half_width = kana2hw ('あいウカキぎょう。');
    # Now $half_width = 'アイウカキギョウ。'

kana2hw converts hiragana, katakana, and fullwidth Japanese punctuation to halfwidth katakana and halfwidth punctuation. Its function is similar to the Emacs command japanese-hankaku-region. For the opposite function, see hw2katakana.

hw2katakana -- Convert halfwidth katakana to katakana

    use Lingua::JA::Moji 'hw2katakana';

    $full_width = hw2katakana ('アイウカキギョウ。');
    # Now $full_width = 'アイウカキギョウ。'

hw2katakana converts halfwidth katakana and Japanese punctuation to fullwidth katakana and punctuation. Its function is similar to the Emacs command japanese-zenkaku-region. For the opposite function, see kana2hw.

is_kana

    use Lingua::JA::Moji 'is_kana';

    

This function returns a true value if its argument is a string of kana, or an undefined value if not. The input cannot contain punctuation or the long vowel symbol (chouonpu).

is_hiragana

    use Lingua::JA::Moji 'is_hiragana';

    

This function returns a true value if its argument is a string of hiragana, and an undefined value if not. The entire string from beginning to end must all be kana for this to return true. The kana cannot include punctuation marks or the long vowel symbol (chouonpu).

kana2katakana -- Convert kana to katakana

    use Lingua::JA::Moji 'kana2katakana';

    

Convert any of katakana, halfwidth katakana, circled katakana and hiragana to full width katakana.

WIDE ASCII

Almost every website in Japan requires users to input numbers and letters using "half width" characters. Use these functions and that is not necessary.

InWideAscii

    use Lingua::JA::Moji 'InWideAscii';

    use utf8;
    if ('A' =~ /\p{InWideAscii}/) {
        print "A is wide ascii\n";
    }

This is a character class for use with \p which matches a "wide ascii" (全角英数字).

wide2ascii -- Convert wide ASCII characters to printable ASCII characters

    use Lingua::JA::Moji 'wide2ascii';

    $ascii = wide2ascii ('abCE019');
    # Now $ascii = 'abCE019'

Convert the "wide ASCII" used in Japan (fullwidth ASCII, 全角英数字) into usual ASCII symbols (半角英数字).

ascii2wide -- Convert printable ASCII characters to wide ASCII characters

    use Lingua::JA::Moji 'ascii2wide';

    $wide = ascii2wide ('abCE019');
    # Now $wide = 'abCE019'

Convert usual ASCII symbols (半角英数字) into the "wide ASCII" used in Japan (fullwidth ASCII, 全角英数字).

kana_order

    use Lingua::JA::Moji 'kana_order';

    $kana_order = kana_order ();

Returns an array reference containing an ordering of the kana.

OTHER TYPES OF LETTERING

kana2morse -- Convert kana to Japanese morse code (wabun code)

    use Lingua::JA::Moji 'kana2morse';

    $morse = kana2morse ('しょっちゅう');
    # Now $morse = '--.-. -- .--. ..-. -..-- ..-'

Convert Japanese kana into Morse code. Note that Japanese morse code does not have any way of representing small kana characters, so converting to and then from morse code will result in しょっちゅう becoming シヨツチユウ.

morse2kana -- Convert Japanese morse code (wabun code) to kana

    use Lingua::JA::Moji 'morse2kana';

    $kana = morse2kana ('--.-. -- .--. ..-. -..-- ..-');
    # Now $kana = 'シヨツチユウ'

Convert Japanese Morse code into kana. Each Morse code element must be separated by whitespace from the next one.

bugs

This has not been extensively tested.

kana2braille -- Convert kana to Japanese braille

    use Lingua::JA::Moji 'kana2braille';

    

Converts kana into the equivalent Japanese braille (tenji) forms.

bugs

This has not been extensively tested. This is not an adequate Japanese braille convertor. Creating Japanese braille requires breaking Japanese sentences up into individual words, but this does not attempt to do that. People who are interested in building a Perl braille convertor could start here.

braille2kana -- Convert Japanese braille to kana

    use Lingua::JA::Moji 'braille2kana';

    

Converts Japanese braille (tenji) into the equivalent katakana.

kana2circled -- Convert kana to circled katakana

    use Lingua::JA::Moji 'kana2circled';

    $circled = kana2circled ('あいうえお');
    # Now $circled = '㋐㋑㋒㋓㋔'

This function converts kana into the "circled katakana" of Unicode, which have code points from 32D0 to 32FE. See also "circled2kana".

Note that there is no circled form of the ン kana, so this is left untouched.

circled2kana -- Convert circled katakana to kana

    use Lingua::JA::Moji 'circled2kana';

    $kana = circled2kana ('㋐㋑㋒㋓㋔');
    # Now $kana = 'アイウエオ'

This function converts the "circled katakana" of Unicode into full-width katakana. See also "kana2circled".

KANJI

new2old_kanji -- Convert Modern kanji to Pre-1949 kanji

    use Lingua::JA::Moji 'new2old_kanji';

    $old = new2old_kanji ('三国 連太郎');
    # Now $old = '三國 連太郎'

Convert new-style (post-1949) kanji (Chinese characters) into old-style (pre-1949) kanji.

bugs

The list of characters in this convertor may not contain every pair of old/new kanji.

It will not correctly convert 弁 since this has three different equivalents in the old system.

old2new_kanji -- Convert Pre-1949 kanji to Modern kanji

    use Lingua::JA::Moji 'old2new_kanji';

    $new = old2new_kanji ('櫻井');
    # Now $new = '桜井'

Convert old-style (pre-1949) kanji (Chinese characters) into new-style (post-1949) kanji.

CYRILLIZATION

This is an experimental cyrillization of kana based on the information in a Wikipedia article, http://en.wikipedia.org/wiki/Cyrillization_of_Japanese. The module author does not know anything about cyrillization of kana, so any assistance in correcting this is very welcome.

kana2cyrillic -- Convert kana to the Cyrillic (Russian) alphabet

    use Lingua::JA::Moji 'kana2cyrillic';

    $cyril = kana2cyrillic ('シンブン');
    # Now $cyril = 'симбун'

cyrillic2katakana -- Convert the Cyrillic (Russian) alphabet to katakana

    use Lingua::JA::Moji 'cyrillic2katakana';

    $kana = cyrillic2katakana ('симбун');
    # Now $kana = 'シンブン'

HANGUL (KOREAN LETTERS)

kana2hangul

    use Lingua::JA::Moji 'kana2hangul';

    $hangul = kana2hangul ('すごわざ');
    # Now $hangul = '스고와자'

bugs

Doesn't deal with ん
May be incorrect

This is based on a list found on the internet at http://kajiritate-no-hangul.com/kana.html. There is currently no proof of correctness.

SUPPORT

Mailing list

There is a mailing list for this module and Convert::Moji at http://groups.google.com/group/perl-moji.

DIAGNOSTICS

bugs

romaji to/from kana conversion

There are some bugs with romaji to kana conversion and vice-versa.

SEE ALSO

Other Perl modules on CPAN include

Japanese kana/romanization

Data::Validate::Japanese

This is where I got several of the ideas for this module from. It contains validators for kanji and kana.

Lingua::JA::Kana

This is where several of the ideas for this module came from. It contains convertors for hiragana, half width and full width katakana, and romaji. The romaji conversion is less complete than this module but more compact and probably much faster.

Lingua::JA::Romanize::Japanese

Romanization of Japanese. The module also includes romanization of kanji via the kakasi kanji to romaji convertor, and other functions.

Lingua::JA::Romaji::Valid

Validate romanized Japanese.

Lingua::JA::Hepburn::Passport
Lingua::JA::Fold

Full/half width conversion, collation of Japanese text.

Books

Parts of this module are covered in the book "Perl CPAN Module Guide" by Naoki Tomita (in Japanese), ISBN 978-4862671080, published by WEB+DB PRESS plus, April 2011.

EXPORT

This module exports its functions only on request. To export all the functions in the module,

    use Lingua::JA::Moji ':all';

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENSE

Copyright 2008-2011 Ben Bullock, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

ACKNOWLEDGEMENTS

Thanks to Naoki Tomita for various assitances (see http://groups.google.com/group/perl-moji/browse_thread/thread/10a42c35f7c22ebc).