Ben Bullock

NAME

Data::Kanji::Kanjidic - parse the "kanjidic" kanji data file

SYNOPSIS

    use Data::Kanji::Kanjidic 'parse_kanjidic';
    my $kanji = parse_kanjidic ('/home/ben/data/edrdg/kanjidic');
    for my $k (keys %$kanji) {
        print "$k has radical number $kanji->{$k}{radical}.\n";
    }

(This example is included as example.pl in the distribution.)

VERSION

This documents Data::Kanji::Kanjidic version 0.16 corresponding to git commit fae7e516e66801df9d011eb61ac69dfca1ff55dd released on Fri Oct 5 10:09:51 2018 +0900.

DESCRIPTION

This extracts the data from the old-format kanjidic kanji dictionary file. See "About Kanjidic" if you are not familiar with this data file.

This module's basic function is to read the kanjidic file into memory and create a data structure from it. Parsing Kanjidic takes a second or two. Here the Kanjidic file is the as-downloaded text file in the old format, rather than the new-format XML file.

FUNCTIONS

parse_kanjidic

    use utf8;
    use Data::Kanji::Kanjidic 'parse_kanjidic';
    my $kanjidic = parse_kanjidic ('/home/ben/data/edrdg/kanjidic');
    print "@{$kanjidic->{猫}{english}}\n";
    # This prints out "cat".
    

(This example is included as parse-kanjidic.pl in the distribution.)

The input is the file name where Kanjidic may be found. The return value is a hash reference. The keys of this hash reference are kanji, encoded as Unicode. Each of the values of the hash reference are entries corresponding to the kanji in the keys. Each value represents one line of Kanjidic. Each is a hash reference, with the keys described in "parse_entry".

This function assumes that the kanjidic file is encoded using the EUC-JP encoding.

parse_entry

    my %values = parse_entry ($line);

Parse one line of Kanjidic. The input is one line from Kanjidic, encoded as Unicode. The return value is a hash containing each field from the line.

The possible keys and values of the returned hash are as follows. Values are scalars unless otherwise mentioned.

kanji

The kanji itself (the same as the key).

jiscode

The JIS code for the kanji in hexadecimal. This is a two-byte number which identifies the kanji in the JIS X 0208 encoding scheme. The JIS value is the second value in Kanjidic after the kanji in encoded form and before the Unicode code point.

B

Bushu (radical as defined by the Nelson kanji dictionary).

C

Classic radical (the usual radical, where this is different from the Nelson radical).

DA

The index numbers used in the 2011 edition of the Kanji & Kana book, by Spahn & Hadamitzky. This may take multiple values, so the value is an array reference.

DB

Japanese for Busy People textbook numbers.

DC

The index numbers used in "The Kanji Way to Japanese Language Power" by Dale Crowley.

DF

"Japanese Kanji Flashcards", by Max Hodges and Tomoko Okazaki.

DG

The index numbers used in the "Kodansha Compact Kanji Guide".

DH

The index numbers used in the 3rd edition of "A Guide To Reading and Writing Japanese" edited by Kenneth Hensall et al.

DJ

The index numbers used in the "Kanji in Context" by Nishiguchi and Kono.

DK

The index numbers used by Jack Halpern in his Kanji Learners Dictionary.

DL

The index numbers used in the 2013 edition of Halpern's Kanji Learners Dictionary.

DM

The index numbers from the French-language version of "Remembering the kanji".

DN

The index number used in "Remembering The Kanji, 6th Edition" by James Heisig.

DO

The index numbers used in P.G. O'Neill's Essential Kanji.

DP

The index numbers used by Jack Halpern in his Kodansha Kanji Dictionary (2013), which is the revised version of the "New Japanese-English Kanji Dictionary" of 1990.

DR

The codes developed by Father Joseph De Roo, and published in his book "2001 Kanji" (Bonjinsha).

DS

The index numbers used in the early editions of "A Guide To Reading and Writing Japanese" edited by Florence Sakade.

DT

The index numbers used in the Tuttle Kanji Cards, compiled by Alexander Kask.

E

The numbers used in Kenneth Henshall's kanji book.

F

Frequency of kanji.

The following example program prints a list of kanji from most to least frequently used.

    use Data::Kanji::Kanjidic 'parse_kanjidic';
    my $kanji = parse_kanjidic ('/home/ben/data/edrdg/kanjidic');
    my @sorted;
    for my $k (keys %$kanji) {
        if ($kanji->{$k}->{F}) {
            push @sorted, $kanji->{$k};
        }
    }
    @sorted = sort {$a->{F} <=> $b->{F}} @sorted;
    for (@sorted) {
        print "$_->{kanji}: $_->{F}\n";
    }
    

(This example is included as frequency.pl in the distribution.)

G

Year of elementary school this kanji is taught.

This field is also used by kanjidic to give information on whether the kanji is part of the Joyo or Jinmeiyo Kanji sets. If the grade is between 1 and 8, the kanji is part of the Joyo Kanji. If the grade is 9 or 10, then the kanji is not part of the Joyo kanji, but it is part of the Jinmeiyo Kanji.

See also "grade".

H

Number in Jack Halpern dictionary.

I

The Spahn-Hadamitzky book number.

IN

The Spahn-Hadamitzky kanji-kana book number.

J

Japanese proficiency test level.

K

The index in the Gakken Kanji Dictionary (A New Dictionary of Kanji Usage).

L

Code from "Remembering the Kanji" by James Heisig.

MN

Morohashi index number.

MP

Morohashi volume/page.

N

Nelson code from original Nelson dictionary.

O

The numbers used in P.G. O'Neill's "Japanese Names". This may take multiple values, so the value is an array reference.

P

SKIP code.

Q

Four-corner code. This may take multiple values, so the value is an array reference.

S

Stroke count. This may take multiple values, so the value is an array reference.

T

SPECIAL.

U

Unicode code point as a hexadecimal number.

V

Nelson code from the "New Nelson" dictionary. This may take multiple values, so the value is an array reference.

W

Korean pronunciation. This may take multiple values, so the value is an array reference.

The following example program prints a list of Korean pronunciations, romanised. This example also requires Lingua::KO::Munja.

    use Data::Kanji::Kanjidic 'parse_kanjidic';
    use Lingua::KO::Munja ':all';
    my $kanji = parse_kanjidic ($ARGV[0]);
    for my $k (sort keys %$kanji) {
        my $w = $kanji->{$k}->{W};
        if ($w) {
            my @h = map {'"' . hangul2roman ($_) . '"'} @$w;
            print "$k is Korean ", join (", ", @h), "\n";
        }
    }

(This example is included as korean.pl in the distribution.)

X

Cross reference.

XDR

De Roo cross-reference. This may take multiple values, so the value is an array reference.

XH

Cross-reference. This may take multiple values, so the value is an array reference.

XI

Cross-reference.

XJ

Cross-reference. This may take multiple values, so the value is an array reference.

XN

Nelson cross-reference. This may take multiple values, so the value is an array reference.

XO

Cross-reference.

Y

Pinyin pronunciation. This may take multiple values, so the value is an array reference.

ZBP

SKIP misclassification by both stroke count and position. This may take multiple values, so the value is an array reference.

ZPP

SKIP misclassification by position. This may take multiple values, so the value is an array reference.

ZRP

SKIP classification disagreement. This may take multiple values, so the value is an array reference.

ZSP

SKIP misclassification by stroke count. This may take multiple values, so the value is an array reference.

radical

This is the Kangxi radical of the kanji. This overrides Kanjidic's preference for the Nelson radical. In other words, this is the same as the "B" field for most kanji, but if a "C" field exists, this is the value of the C field rather than the B field.

kokuji

This has a true value (1) if the character is marked as a "kokuji" in Kanjidic. See Which kanji were created in Japan? for more on kokuji.

english

This contains an array reference to the English-language meanings given in Kanjidic. It may be undefined, if there are no English-language meanings listed.

    # The following "joke" program converts English into kanji.
    
    # Call it with two arguments, first the location of kanjidic, and
    # second a file of English text to "corrupt":
    #
    # ./english-to-kanji.pl /where/is/kanjidic english-text-file
    
    use Data::Kanji::Kanjidic 'parse_kanjidic';
    use Convert::Moji 'make_regex';
    my $kanji = parse_kanjidic ($ARGV[0]);
    my %english;
    for my $k (keys %$kanji) {
        my $english = $kanji->{$k}->{english};
        if ($english) {
            for (@$english) {
                push @{$english{$_}}, $k;
            }
        }
    }
    my $re = make_regex (keys %english);
    open my $in, "<", $ARGV[1] or die $!;
    while (<$in>) {
        s/\b($re)\b/$english{$1}[int rand (@{$english{$1}})]/ge;
        print;
    }
    

(This example is included as english-to-kanji.pl in the distribution.)

Given input like this,

    This is an example of the use of "english-to-kanji.pl", a program which
    converts English words into kanji. This may or may not be regarded as a
    good idea. What do you think?

it outputs this:

    This is an 鑒 之 彼 使 之 "english負to負kanji.pl", a program 孰
    converts 英 辭 into kanji. This 得 将 得 無 跨 regarded as a
    臧 見. What 致 尓 憶?
onyomi

This is an array reference which contains the on'yomi of the kanji. It may be undefined, if no on'yomi readings are listed. The on'yomi readings are in katakana, as per Kanjidic itself. It is encoded in Perl's internal Unicode encoding.

The following example prints a list of kanji which have the same on'yomi:

    use Data::Kanji::Kanjidic 'parse_kanjidic';
    use utf8;
    my $kanji = parse_kanjidic ($ARGV[0]);
    my %all_onyomi;
    for my $k (keys %$kanji) {
        my $onyomi = $kanji->{$k}->{onyomi};
        if ($onyomi) {
            for my $o (@$onyomi) {
                push @{$all_onyomi{$o}}, $k;
            }
        }
    }
    for my $o (sort keys %all_onyomi) {
        if (@{$all_onyomi{$o}} > 1) {
            print "Same onyomi 「$o」 for 「@{$all_onyomi{$o}}」!\n";
        }
    }

(This example is included as onyomi-same.pl in the distribution.)

kunyomi

This is an array reference which contains the kun'yomi. It may be undefined, if no kun'yomi readings are listed. The kun'yomi readings are in hiragana, as per Kanjidic itself. It is encoded in Perl's internal Unicode encoding.

nanori

This is an array reference which contains nanori (名乗り) readings of the character, which are readings of the kanji used in names. It may be undefined, if no nanori readings are listed. The nanori readings are in hiragana, as per Kanjidic itself. They are encoded in Perl's internal Unicode encoding.

morohashi

This is a hash reference containing data on the kanji's location in the Morohashi 'Dai Kan-Wa Jiten' kanji dictionary. The hash reference has the following keys.

volume

The volume number of the character.

page

The page number of the character.

index

The index number of the character.

If there is no information, this remains unset.

For example, to print all the existing values,

    use Data::Kanji::Kanjidic 'parse_kanjidic';
    my $kanji = parse_kanjidic ("/home/ben/data/edrdg/kanjidic");
    for my $k (sort keys %$kanji) {
        my $mo = $kanji->{$k}->{morohashi};
        if ($mo) {
            print "$k: volume $mo->{volume}, page $mo->{page}, index $mo->{index}.\n";
        }
    }
    

(This example is included as morohashi.pl in the distribution.)

For detailed explanations of these codes, see the kanjidic documentation, which is linked to under "About Kanjidic".

kanjidic_order

    my @order = kanjidic_order ($k);

This returns a list of the keys of $k sorted by their JIS code number, which is the ordering used by the Kanjidic file itself.

kanji_dictionary_order

    my @sorted = sort {kanji_dictionary_order ($kanjidic_ref, $a, $b)} @kanji;

This is a comparision function which puts kanji in the order they would be found in a Japanese kanji dictionary. Elements are sorted by the value of the "radical" field, then by the first stroke count value (the first entry of the "S" field) if they both have the same radical. Elements with the same stroke count and radical are finally sorted in order of their JIS code value.

This also adds a new field "kanji_id" to each element of $kanjidic_ref so that the order can be reconstructed when referring to elements.

See How is a kanji dictionary used? for more on kanji dictionary ordering. See What are kanji radicals? for more on kanji radicals.

stroke_radical_jis_order

    @list = sort { stroke_radical_jis_order ($kanjidic, $a, $b) } @list;

This is a comparison function which sorts kanji $a and $b according to stroke count, the "S" field. If the stroke count is identical, it sorts them according to "radical". If both the stroke count and radical are the same, it sorts them according to "jiscode".

grade_stroke_order

This is like "kanji_dictionary_order" and "stroke_radical_jis_order", except it sorts the kanji by school grade, then by stroke number, then by JIS number.

This function is used to make this List of kanji by elementary school grade.

grade

    my $grade2 = grade ($kanjidic_ref, 2);

Given a school grade such as 2 above, and the return value of "parse_kanjidic", $kanjidic_ref, return an array reference containing a list of all of the kanji from that grade. See How is Japanese writing taught to Japanese children? for more on details of the Japanese education system.

The following example prints a list of the kanji from each school grade to standard output:

    use Data::Kanji::Kanjidic qw/parse_kanjidic grade/;
    my $kanjidic = parse_kanjidic ('/home/ben/data/edrdg/kanjidic');
    for my $grade (1..6) {
        my $list = grade ($kanjidic, $grade);
        print "Grade $grade:\n\n";
        my $count = 0;
        for (sort @$list) {
            print "$_ ";
            $count++;
            if ($count % 20 == 0) {
                print "\n";
            }
        }
        print "\n";
    }
    

(This example is included as grades.pl in the distribution.)

SEE ALSO

Other Perl modules

Lingua::JP::Kanjidic

This module parses an old version of kanjidic.

About Kanjidic

Kanjidic is a product of the Electronic Dictionary Research and Development Group (EDRDG), headed by Professor J.W. Breen, formerly of Monash University, Australia.

Kanjidic is currently supplied in two formats, a text format with the kanji encoded in the EUC-JP encoding, and an XML format with the same kanji data encoded in Unicode. This module parses the older text format of kanjidic.

Documentation

The following links point to the documentation of Kanjidic.

Summary description of kanjidic
Full documentation
Download

Download the kanjidic file at one of the following sites:

ftp://ftp.edrdg.org/pub/Nihongo/00INDEX.html
http://ftp.monash.edu.au/pub/nihongo/00INDEX.html
Licence

Kanjidic's licence terms are explained at http://www.edrdg.org/edrdg/licence.html.

EXPORTS

Nothing is exported by default. All the functions and variables exported by the module may be exported by using the export tag ":all":

    use Data::Kanji::Kanjidic ':all';

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

This package and associated files are copyright (C) 2011-2018 Ben Bullock.

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.