NAME

Lingua::JA::Moji - Handle many kinds of Japanese characters

SYNOPSIS

Convert various types of Japanese characters into one another.

    use Lingua::JA::Moji qw/kana2romaji romaji2kana/;
    use utf8;
    my $romaji = kana2romaji ('あいうえお');
    # $romaji is now 'aiueo'.
    my $kana = romaji2kana ($romaji);
    # $kana is now 'アイウエオ'.

DESCRIPTION

This module provides methods to convert different written forms of Japanese into one another. It enables conversion between romanized Japanese, hiragana, and katakana. It also includes a number of unusual encodings such as Japanese braille and morse code, as well as conversions between Japanese and Cyrillic and Hangul. It also handles conversion between the Chinese characters (kanji) used before and after the character reforms of 1949, as well as the various bracketed and circled forms of kana and kanji.

All the functions in this module assume the use of Unicode encoding. All input and output strings must be encoded using Perl's "UTF-8" format.

The module loads the various data format conversion files on demand, thus the various obscure conversions hopefully do not cause a memory burden.

This module does not handle the conversion of kanji words into kana, or kana into kanji.

ROMANIZATION

These functions convert Japanese letters to and from romanized forms.

kana2romaji -- Convert kana to romaji

    use Lingua::JA::Moji 'kana2romaji';

    $romaji = kana2romaji ("うれしいこども");
    # Now $romaji = 'uresîkodomo'

Convert kana to a romanized form.

An optional second argument, a hash reference, controls the style of conversion.

    use utf8;
    $romaji = kana2romaji ("しんぶん", {style => "hepburn"});
    # $romaji = "shimbun"

The options are

style

The style of romanization. The default style of romanization is "Nippon-shiki". The user can set the conversion style to "hepburn" or "passport" or "kunrei". If Hepburn is selected, then the following option use_m is set to "true", and the ve_type is set to "macron".

use_m

If this is true, syllabic ns (ん) which come before "b" or "p" sounds, such as the first "n" in "shinbun" (しんぶん, newspaper) will be converted into "m" rather than "n".

ve_type

The ve_type option controls how long vowels are written. The default is to use circumflexes to represent long vowels.

undef: A circumflex is used.
macron: A macron is used.
passport: "Oh" is used to write long "o" vowels, and other long vowels are ignored.
none: Long vowels are not indicated.
wapuro: Chouon marks become hyphens, and おう becomes ou.

romaji2kana -- Convert romaji to kana

    use Lingua::JA::Moji 'romaji2kana';

    $kana = romaji2kana ('yamaguti');
    # Now $kana = 'ヤマグチ'

Convert romanized Japanese to katakana. The romanization is highly liberal and will attempt to convert any romanization it sees into katakana. The romanization is based on the behaviour of the Microsoft IME (input method editor). To convert romanized Japanese into hiragana, use "romaji2hiragana".

An optional second argument to the function contains options in the form of a hash reference,

     $kana = romaji2kana ($romaji, {wapuro => 1});

Use an option wapuro => 1 to convert long vowels into the equivalent kana rather than "chouon".

romaji2hiragana -- Convert romaji to hiragana

    use Lingua::JA::Moji 'romaji2hiragana';

    $hiragana = romaji2hiragana ('babubo');
    # Now $hiragana = 'ばぶぼ'

Convert romanized Japanese into hiragana. This takes the same options as "romaji2kana". It also switches on the "wapuro" option, which uses long vowels with a kana rather than a "chouon".

romaji_styles

    use Lingua::JA::Moji 'romaji_styles';

    my @styles = romaji_styles ();
    # Returns a true value
    romaji_styles ("hepburn");
    # Returns the undefined value
    romaji_styles ("frogs");

Given an argument, this return a true value if it is a known style of romanization.

Without an argument, it returns a list of possible styles, as an array of hash references, with each hash reference containing the short name under the key "abbrev" and the full name under the key "full_name".

is_voiced

    use Lingua::JA::Moji 'is_voiced';

    if (is_voiced ('が')) {
         print "が is voiced.\n";
    }

Given a kana or romaji input, is_voiced returns a true value if the sound is a voiced sound like a, za, ga, etc. and the undefined value if not.

is_romaji

    use Lingua::JA::Moji 'is_romaji';

    # The following line returns "undef"
    is_romaji ("abcdefg");
    # The following line returns a defined value
    is_romaji ('loyehye');
    # The following line returns a defined value
    is_romaji ("atarimae");

This detects whether a string of alphabetical characters, which may also include characters with macrons or circumflexes, "looks like" romanized Japanese. If the test is successful, it returns a true value, and if the test is unsuccessful, it returns a false value. If the string is empty, it returns a false value.

This works by converting the string to kana via "romaji2kana" and seeing if it converts cleanly or not.

is_romaji_strict

    use Lingua::JA::Moji 'is_romaji_strict';

    # The following line returns "undef"
    is_romaji_strict ("abcdefg");
    # The following line returns "undef"
    is_romaji_strict ('loyehye');
    # The following line returns a defined value
    is_romaji_strict ("atarimae");

This test is much stricter than "is_romaji". It insists that the word does not contain constructions which may be valid as inputs to an IME, but which do not look like Japanese words.

normalize_romaji

    use Lingua::JA::Moji 'normalize_romaji';

    $normalized = normalize_romaji ('tsumuji');

normalize_romaji converts romanized Japanese to a canonical form, which is based on the Nippon-shiki romanization, but without representing long vowels using a circumflex. In the canonical form, sokuon (っ) characters are converted into the string "xtu". If there is kana in the input string, this will also be converted to romaji.

normalize_romaji is for comparing two Japanese words which may be represented in different ways, for example in different romanization systems, to see if they refer to the same word despite the difference in writing. It does not provide a standardized or officially-sanctioned form of romanization.

KANA

These functions convert one form of kana into another.

hira2kata -- Convert hiragana to katakana

    use Lingua::JA::Moji 'hira2kata';

    $katakana = hira2kata ('ひらがな');
    # Now $katakana = 'ヒラガナ'

hira2kata converts hiragana into katakana. The input may be a single string or a list of strings. If the input is a list, it converts each element of the list, and in list context it returns a list of the converted inputs. In scalar context it returns a concatenation of the strings.

    my @katakana = hira2kata (@hiragana);

This does not convert "chouon" signs.

kata2hira -- Convert katakana to hiragana

    use Lingua::JA::Moji 'kata2hira';

    $hiragana = kata2hira ('カキクケコ');
    # Now $hiragana = 'かきくけこ'

kata2hira converts full-width katakana into hiragana. If the input is a list, it converts each element of the list, and in list context, returns a list of the converted inputs, otherwise it returns a concatenation of the strings.

    my @hiragana = hira2kata (@katakana);

This function does not convert "chouon" signs into long vowels. It also does not convert half-width katakana into hiragana.

kana2katakana -- Convert kana to katakana

    use Lingua::JA::Moji 'kana2katakana';

This converts any of katakana, "halfwidth katakana", circled katakana and hiragana to full width katakana.

kana_to_large

    use Lingua::JA::Moji 'kana_to_large';

    $large = kana_to_large ('ぁあぃい');
    # Now $large = 'ああいい'

Convert small-sized kana such as 「ぁ」 into full-sized kana such as 「あ」.

InHankakuKatakana

    use Lingua::JA::Moji 'InHankakuKatakana';

    use utf8;
    if ('ｱ' =~ /\p{InHankakuKatakana}/) {
        print "ｱ is half-width katakana\n";
    }

InHankakuKatakana is a character class for use in regular expressions with \p which can validate "halfwidth katakana".

kana2hw -- Convert kana to halfwidth katakana

    use Lingua::JA::Moji 'kana2hw';

    $half_width = kana2hw ('あいウカキぎょう。');
    # Now $half_width = 'ｱｲｳｶｷｷﾞｮｳ｡'

kana2hw converts hiragana, katakana, and fullwidth Japanese punctuation to "halfwidth katakana" and halfwidth punctuation. Its function is similar to the Emacs command japanese-hankaku-region. For the opposite function, see hw2katakana. See also "katakana2hw" for a function which only converts katakana.

hw2katakana -- Convert halfwidth katakana to katakana

    use Lingua::JA::Moji 'hw2katakana';

    $full_width = hw2katakana ('ｱｲｳｶｷｷﾞｮｳ｡');
    # Now $full_width = 'アイウカキギョウ。'

hw2katakana converts "halfwidth katakana" and halfwidth Japanese punctuation to fullwidth katakana and fullwidth punctuation. Its function is similar to the Emacs command japanese-zenkaku-region. For the opposite function, see kana2hw.

katakana2hw -- Convert katakana to halfwidth katakana

    use Lingua::JA::Moji 'katakana2hw';

    $hw = katakana2hw ("あいうえおアイウエオ");
    # Now $hw = 'あいうえおｱｲｳｴｵ'

This converts katakana to "halfwidth katakana", leaving hiragana unchanged. See also "kana2hw".

is_kana

    use Lingua::JA::Moji 'is_kana';

This function returns a true value if its argument is a string of kana, or an undefined value if not. The input cannot contain punctuation or "chouon".

is_hiragana

    use Lingua::JA::Moji 'is_hiragana';

This function returns a true value if its argument is a string of hiragana, and an undefined value if not. The entire string from beginning to end must all be kana for this to return true. The kana cannot include punctuation marks or "chouon".

kana_order

    use Lingua::JA::Moji 'kana_order';

    $kana_order = kana_order ();

Returns an array reference containing an ordering of the kana. This is useful for looping over the kana or sorting.

katakana2syllable

    use Lingua::JA::Moji 'katakana2syllable';

    $syllables = katakana2syllable ('ソーシャルブックマークサービス');

This breaks the given string into syllables. If the string is broken up character by character, it becomes 'ソ', 'ー', 'シ', 'ャ', 'ル'. This breaks the string up into meaningful syllables, so that $syllables becomes 'ソー', 'シャ', 'ル'.

InKana

    use Lingua::JA::Moji 'InKana';

    $is_kana = ('あいうえお' =~ /^\p{InKana}+$/);
    # Now $is_kana = '1'

A character class for use in regular expressions which matches all kana characters. This class catches meaningful combinations of hiragana, katakana, halfwidth katakana, circled katakana, and katakana combined words.

This is a combination of the existing Perl character classes Katakana, InKatakana, and InHiragana, minus unassigned characters, plus the "halfwidth katakana prolonged sound mark" (U+FF70) <ｰ> (chouon), the "halfwidth katakana voiced sound mark" (U+FF9E) <ﾞ> (dakuten) and the "halfwidth katakana semivoiced sound mark" (U+FF9F) <ﾟ> (handakuten). It is somewhat like the following:

    qr/\p{Katakana}|\p{InKatakana}|\p{InHiragana}|ｰ|ﾞ|ﾟ>/

except that the unassigned points which are matched by \p{Katakana} are not matched.

WIDE ASCII FUNCTIONS

Functions for handling "wide ASCII".

InWideAscii

    use Lingua::JA::Moji 'InWideAscii';

    use utf8;
    if ('Ａ' =~ /\p{InWideAscii}/) {
        print "Ａ is wide ascii\n";
    }

This is a character class for use with \p which matches "wide ASCII"

wide2ascii -- Convert wide ASCII characters to printable ASCII characters

    use Lingua::JA::Moji 'wide2ascii';

    $ascii = wide2ascii ('ａｂＣＥ０１９');
    # Now $ascii = 'abCE019'

Convert "wide ASCII" into ASCII.

ascii2wide -- Convert printable ASCII characters to wide ASCII characters

    use Lingua::JA::Moji 'ascii2wide';

    $wide = ascii2wide ('abCE019');
    # Now $wide = 'ａｂＣＥ０１９'

Convert ASCII into "wide ASCII".

OTHER TYPES OF LETTERING

kana2morse -- Convert kana to Japanese morse code (wabun code)

    use Lingua::JA::Moji 'kana2morse';

    $morse = kana2morse ('しょっちゅう');
    # Now $morse = '--.-. -- .--. ..-. -..-- ..-'

Convert Japanese kana into Morse code. Japanese morse code does not have any way of representing small kana characters, so converting to and then from morse code will result in しょっちゅう becoming シヨツチユウ.

morse2kana -- Convert Japanese morse code (wabun code) to kana

    use Lingua::JA::Moji 'morse2kana';

    $kana = morse2kana ('--.-. -- .--. ..-. -..-- ..-');
    # Now $kana = 'シヨツチユウ'

Convert Japanese Morse code into kana. Each Morse code element must be separated by whitespace from the next one.

Bugs

This has not been extensively tested.

kana2braille -- Convert kana to Japanese braille

    use Lingua::JA::Moji 'kana2braille';

This converts kana into the equivalent Japanese braille (tenji) forms.

Bugs

This has not been extensively tested. This is not an adequate Japanese braille convertor. Creating Japanese braille requires breaking Japanese sentences up into individual words, but this does not attempt to do that. People who are interested in building a Perl braille convertor could start here.

braille2kana -- Convert Japanese braille to kana

    use Lingua::JA::Moji 'braille2kana';

Converts Japanese braille (tenji) into the equivalent katakana.

kana2circled -- Convert kana to circled katakana

    use Lingua::JA::Moji 'kana2circled';

    $circled = kana2circled ('あいうえお');
    # Now $circled = '㋐㋑㋒㋓㋔'

This function converts kana into the "circled katakana" of Unicode, which have code points from 32D0 to 32FE. See also "circled2kana".

There is no circled form of the ン kana, so this is left untouched.

circled2kana -- Convert circled katakana to kana

    use Lingua::JA::Moji 'circled2kana';

    $kana = circled2kana ('㋐㋑㋒㋓㋔');
    # Now $kana = 'アイウエオ'

This function converts the "circled katakana" of Unicode into full-width katakana. See also "kana2circled".

KANJI

new2old_kanji -- Convert Modern kanji to Pre-1949 kanji

    use Lingua::JA::Moji 'new2old_kanji';

    $old = new2old_kanji ('三国 連太郎');
    # Now $old = '三國 連太郎'

Convert new-style (post-1949) kanji (Chinese characters) into old-style (pre-1949) kanji.

Bugs

The list of characters in this convertor may not contain every pair of old/new kanji.

It will not correctly convert 弁 since this has three different equivalents in the old system.

old2new_kanji -- Convert Pre-1949 kanji to Modern kanji

    use Lingua::JA::Moji 'old2new_kanji';

    $new = old2new_kanji ('櫻井');
    # Now $new = '桜井'

Convert old-style (pre-1949) kanji (Chinese characters) into new-style (post-1949) kanji.

circled2kanji

    use Lingua::JA::Moji 'circled2kanji';

    $kanji = circled2kanji ('㊯');
    # Now $kanji = '協'

Convert the circled forms of kanji into their uncircled equivalents.

kanji2circled

    use Lingua::JA::Moji 'kanji2circled';

    $kanji = kanji2circled ('協嬉');
    # Now $kanji = '㊯嬉'

Convert the usual forms of kanji into circled equivalents, if they exist. Note that only a limited number of kanji have circled forms.

bracketed2kanji

    use Lingua::JA::Moji 'bracketed2kanji';

    $kanji = bracketed2kanji ('㈱');
    # Now $kanji = '株'

Convert bracketed form of kanji into unbracketed form.

kanji2bracketed

    use Lingua::JA::Moji 'kanji2bracketed';

    $kanji = kanji2bracketed ('株');
    # Now $kanji = '㈱'

Convert unbracketed form of kanji into bracketed form, if it exists.

CYRILLIZATION

This is an experimental cyrillization of kana based on the information in a Wikipedia article, http://en.wikipedia.org/wiki/Cyrillization_of_Japanese. The module author does not know anything about cyrillization of kana, so any assistance in correcting this is very welcome.

kana2cyrillic -- Convert kana to the Cyrillic (Russian) alphabet

    use Lingua::JA::Moji 'kana2cyrillic';

    $cyril = kana2cyrillic ('シンブン');
    # Now $cyril = 'симбун'

cyrillic2katakana -- Convert the Cyrillic (Russian) alphabet to katakana

    use Lingua::JA::Moji 'cyrillic2katakana';

    $kana = cyrillic2katakana ('симбун');
    # Now $kana = 'シンブン'

HANGUL (KOREAN LETTERS)

kana2hangul

    use Lingua::JA::Moji 'kana2hangul';

    $hangul = kana2hangul ('すごわざ');
    # Now $hangul = '스고와자'

Bugs

Doesn't deal with syllabic n
May be incorrect: This is based on a list found on the internet at http://kajiritate-no-hangul.com/kana.html. There is currently no proof of correctness.

SUPPORT

Mailing list

There is a mailing list for this module and Convert::Moji at http://groups.google.com/group/perl-moji.

Other modules

NOTES

chouon

The long vowel marker, "ー", or chōon, which is used in Japanese katakana to indicate a lengthened vowel.

wide ASCII

Wide ASCII, fullwidth ASCII, or zenkaku eisūji (全角英数字) are a legacy of bitmapped fonts which has survived into the present day. "Wide ascii" characters were originally special bitmapped font characters created to be the same size as one kanji or kana character. The name for normal ASCII characters in Japanese is hankaku eisūji (半角英数字), literally "half width English letters and numerals".

Halfwidth katakana

Halfwidth katakana, hankaku katakana (半角かたかな) is a legacy encoding of katakana based on an eight-bit encoding. See http://www.sljfaq.org/afaq/half-width-katakana.html for full details.

EXPORT

This module exports its functions only on request. To export all the functions in the module,

    use Lingua::JA::Moji ':all';

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

ACKNOWLEDGEMENTS

Thanks to Naoki Tomita and David Steinbrunner for fixes.

To install Lingua::JA::Moji, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::JA::Moji

CPAN shell

perl -MCPAN -e shell
install Lingua::JA::Moji

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

ROMANIZATION

kana2romaji -- Convert kana to romaji

romaji2kana -- Convert romaji to kana

romaji2hiragana -- Convert romaji to hiragana

romaji_styles

is_voiced

is_romaji

is_romaji_strict

normalize_romaji

KANA

hira2kata -- Convert hiragana to katakana

kata2hira -- Convert katakana to hiragana

kana2katakana -- Convert kana to katakana

kana_to_large

InHankakuKatakana

kana2hw -- Convert kana to halfwidth katakana

hw2katakana -- Convert halfwidth katakana to katakana

katakana2hw -- Convert katakana to halfwidth katakana

is_kana

is_hiragana

kana_order

katakana2syllable

InKana

WIDE ASCII FUNCTIONS

InWideAscii

wide2ascii -- Convert wide ASCII characters to printable ASCII characters

ascii2wide -- Convert printable ASCII characters to wide ASCII characters

OTHER TYPES OF LETTERING

kana2morse -- Convert kana to Japanese morse code (wabun code)

morse2kana -- Convert Japanese morse code (wabun code) to kana

Bugs

kana2braille -- Convert kana to Japanese braille

Bugs

braille2kana -- Convert Japanese braille to kana

kana2circled -- Convert kana to circled katakana

circled2kana -- Convert circled katakana to kana

KANJI

new2old_kanji -- Convert Modern kanji to Pre-1949 kanji

Bugs

old2new_kanji -- Convert Pre-1949 kanji to Modern kanji

circled2kanji

kanji2circled

bracketed2kanji

kanji2bracketed

CYRILLIZATION

kana2cyrillic -- Convert kana to the Cyrillic (Russian) alphabet

cyrillic2katakana -- Convert the Cyrillic (Russian) alphabet to katakana

HANGUL (KOREAN LETTERS)

kana2hangul

Bugs

SUPPORT

Mailing list

Other modules

SEE ALSO

Japanese kana/romanization

Kana/kanji conversion

Books

NOTES

chouon

wide ASCII

Halfwidth katakana

EXPORT

AUTHOR

COPYRIGHT & LICENSE

ACKNOWLEDGEMENTS

Module Install Instructions