Lingua::JA::Sort::JIS - compares and sorts strings encoded in UTF-8
use Lingua::JA::Sort::JIS qw(jsort); @result = jsort(@not_sorted);
This module provides some functions to compare and sort strings encoded in UTF-8 using the collation of Japanese character strings.
This module is an implementation of JIS X 4061-1996 and the collation rules are based on that standard.
The following criteria are considered in order until the collation order is determined. By default, Levels 1 to 4 are applied and Level 5 is ignored (as JIS does).
The character class early appeared in the following list is smaller.
Space characters, Symbols and Punctuations, Digits, Greek Letters, Cyrillic Letters, Latin letters, Kana letters, ( Kanji ideographs ), and Geta mark.
In the class, alphabets are collated alphabetically; kana letters are AIUEO-betically (in the Gozyuon order).
For Kanji, see Kanji Classes.
Other characters are collated as defined.
Characters not defined as a collation element are ignored and skipped on collation.
BN: Especially, almost alphabets with any diacritical mark are NOT defined in this implement, excepting Latin vowels with macron or circumflex, because they are not used in Japanese contexts.
In the Latin vowels, the order is as shown the following list.
One without diacritical mark, with macron, then with circumflex.
In kana, the order is as shown the following list.
A voiceless kana, the voiced, then the semi-voiced (if exists). (eg. Ka before Ga; Ha before Ba before Pa)
A small Latin is lesser than the corresponding Capital.
replaced PROLONGED SOUND MARK(U+30FC); Small kana; replaced ITERATION MARK (U+309D, U+309E, U+30FD or U+30FE); then normal kana.
For example, Katakana A + PROLONGED SOUND MARK, Katakana A + Small Katakana A, Katakana A + ITERATION MARK, Katakana A + Katakana A. (see NOTE about the replacement)
Katakana A + PROLONGED SOUND MARK
Katakana A + Small Katakana A
Katakana A + ITERATION MARK
Katakana A + Katakana A
Hiragana is lesser than katakana.
A character that belongs to the block Halfwidth and Fullwidth Forms is greater than the corresponding normal character.
Halfwidth and Fullwidth Forms
BN: According to the JIS standard, the level 5 should be ignored.
There are three kanji classes:
It comprises five kanji-like chars, i.e. U+3003, U+4EDD, U+3005, U+3006, U+3007 (collated in the JIS order as shown). Any kanji except U+4EDD are ignored on collation.
It comprises JIS levels 1 and 2 kanji in addition to the minimum kanji class. Sorted in the JIS order. Any kanji excepting those defined by JIS X 0208 are ignored on collation.
All the CJK Unified Ideographs in addition to the minimum kanji class. Sorted in the unicode order.
$jis = Lingua::JA::Sort::JIS->new()
$jis = Lingua::JA::Sort::JIS->new(LEVEL)
$jis = Lingua::JA::Sort::JIS->new(LEVEL, KANJI CLASS)
$jis = Lingua::JA::Sort::JIS->new(CODE REF, LEVEL, KANJI CLASS)
Constructs an instance.
The collation level is specified as a number between 1 and 5. If omitted, level 4 is applied. The kanji class is specified as a number between 1 and 3. If omitted, class 2 is applied.
If a coderef is specified as the first argument, strings given to a collating method are converted by the coderef before making collating keys.
For example, if you want to ignore PROLONGED SOUND MARK ("\xE3\x83\xBC" in UTF-8) on collation,
PROLONGED SOUND MARK
"\xE3\x83\xBC"
use Lingua::JA::Sort::JIS; $jis = Lingua::JA::Sort::JIS->new( sub { my $str = shift; $str =~ s/\xE3\x83\xBC//g; $str; } ); @sorted = $jis->jsort(@strings); # utf-8 encoded
If you want to collate strings encoded in EUC-JP, give the constructor a coderef converting EUC-JP to UTF-8.
use Lingua::JA::Sort::JIS; $euc = Lingua::JA::Sort::JIS->new( sub { some_convertor_from_eucjp_to_utf8($_[0]) } ); @sorted_euc_jp_strings = $euc->jsort(@euc_jp_strings);
$jis->jsort(LIST)
Sorts a list of strings in the UTF-8 encoding
$jis->jcmp($a, $b)
Japanese Collation version of the cmp operator. It returns 1 ($a is greater than $b) or 0 ($a is equal to $b) or -1 ($a is lesser than $b).
cmp
$a
$b
jsort(LIST)
jsort(CODE REF, LIST)
Sorts a list of strings in the UTF-8 encoding (as the collation level and the kanji class, the default values are used, and jsort() without any object is identical to bsort()).
jsort()
bsort()
For example, if you want to collate strings encoded in Shift_JIS, do as following.
use Lingua::JA::Sort::JIS qw(jsort); $sjis_to_utf8 = \&some_convertor_from_shiftjis_to_utf8; @sorted = jsort $sjis_to_utf8, @not_sorted;
msort(LIST)
msort(CODE REF, LIST)
Sorts a list of strings in the UTF-8 encoding (the collation level is 4 and the kanji class is 1, m: minimum).
m
bsort(LIST)
bsort(CODE REF, LIST)
Sorts a list of strings in the UTF-8 encoding (the collation level is 4 and the kanji class is 2, b: basic).
b
xsort(LIST)
xsort(CODE REF, LIST)
Sorts a list of strings in the UTF-8 encoding (the collation level is 4 and the kanji class is 3, x: extented).
x
fsort(LIST)
fsort(CODE REF, LIST)
Sorts a list of strings in the UTF-8 encoding (the collation level is 5 and the kanji class is 2, f: fullwidth).
f
jcmp( [ CODEREF ], $a, $b, [ LEVEL, KANJI CLASS ])
The LEVEL (collation level) is specified as a number between 1 and 5. If omitted, level 4 is applied.
LEVEL
The KANJI CLASS (kanji class) is specified as a number between 1 and 3. If omitted, class 2 is applied.
KANJI CLASS
If CODE REF is specified as the first argument, strings given to a collating method are converted by the coderef before making collating keys.
CODE REF
The CODE REF, LEVEL and the KANJI CLASS can be omitted if not necessary.
e.g. jcmp("perl", "Perl") returns -1 and jcmp("perl", "Perl", 2) returns 0 since "perl" is tertiary and quarternary less than "Perl", and secondary equal to.
jcmp("perl", "Perl")
-1
jcmp("perl", "Perl", 2)
0
"perl"
"Perl"
karr([ CODE REF ], STRING, [ KANJI CLASS ] )
kcmp(KEY ARRAY, KEY ARRAY, [ LEVEL ])
These functions allow you to do the Schwartzian transform.
karr() makes KEY ARRAY from STRING.
karr()
KEY ARRAY
STRING
kcmp() returns 1 (The first KEY ARRAY is greater than the second KEY ARRAY) or 0 (The first KEY ARRAY is equal to the second KEY ARRAY) or -1 (The first KEY ARRAY is lesser than the second KEY ARRAY).
kcmp()
The following example is sorting by "yomi-hyoki" collation, in which "yomi" (or pronunciation) is used as the first sorting key, and "hyoki" (or spell) is used as the second sorting key.
"yomi-hyoki"
"yomi"
"hyoki"
use Lingua::JA::Sort::JIS; $jis = Lingua::JA::Sort::JIS->new(); foreach(ysort(@data)){ print "@$_\n"; } sub ysort { map { $_->[0] } sort{ $jis->kcmp($a->[1], $b->[1]) || $jis->kcmp($a->[2], $b->[2]) } map { [$_, $jis->karr($_->[1]), $jis->karr($_->[0]) ] } @_; }
use Lingua::JA::Sort::JIS qw(kcmp karr); foreach(ysort(@data)){ print "@$_\n"; } sub ysort { map { $_->[0] } sort{ kcmp($a->[1], $b->[1]) || kcmp($a->[2], $b->[2]) } map { [$_, karr($_->[1]), karr($_->[0]) ] } @_; }
getorder()
In the list context, it returns the collation element hash; otherwise, it returns the reference of that hash.
In the collation element hash, each key is the collation element string and each value is the anonymous array with 5 elements.
You can manipulate the collation element hash like as follows.
my $order = getorder(); # delete 'X' from the collation element hash delete $order->{'X'}; # swap the collation order between 'b' and 'B'; @$order{'B', 'b'} = @$order{'b', 'B'}; # add a new collation element HIRAGANA LETTER VU; my $hira_vu = "\xE3\x82\x94"; my $kata_vu = "\xE3\x83\xB4"; $order->{$hira_vu} = [ @{ $order->{$kata_vu} } ]; -- $order->{$hira_vu}[3]; # HIRAGANA VU to be quarternary lesser than KATAKANA VU.
RFC1345 UCS [*5] U+309D HIRAGANA ITERATION MARK [+5] U+309E HIRAGANA VOICED ITERATION MARK [-6] U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK [*6] U+30FD KATAKANA ITERATION MARK [+6] U+30FE KATAKANA VOICED ITERATION MARK
To represent Japanese characters, RFC 1345 Mnemonic characters enclosed by brackets are used below.
These characters, if replaced, are secondary equal to the replacing kana, while ternary not equal to.
The PROLONGED MARK is repleced by normal vowel or nasal katakana corresponding to the preceding kana if exists.
eg. [Ka][-6] to [Ka][A6] [bi][-6] to [bi][I6] [Pi][YU][-6] to [Pi][YU][U6] [N6][-6] to [N6][N6]
The ITERATION MARKs (VOICELESS) are repleced by normal kana corresponding to the preceding kana if exists.
eg. [Ka][*6] to [Ka][Ka] [Do][*5] to [Do][to] [n5][*5] to [n5][n5] [Pu][*6] to [Pu][Hu] [Pi][YU][*6] to [Pi][YU][Yu]
The VOICED ITERATION MARKs are repleced by the voiced kana corresponding to the preceding kana if exists.
eg. [ha][+5] to [ha][ba] [Pu][+5] to [Pu][bu] [Ko][+6] to [Ko][Go] [U6][+6] to [U6][Vu]
Otherwise, no replacement occurs. Especially in the cases when these marks follow any character except kana.
The characters not replaced are primary greater than any kana (see "Collate.txt").
"Collate.txt"
eg. CJK followed by PROLONGED SOUND MARK DIGIT followed by ITERATION [A6][+6] ([A6] has no voiced variant)
For example, the Japanese string [Pa][-6][Ru] (spell of Perl in Japanese) has three collation elements: KATAKANA PA, PROLONGED SOUND MARK replaced by KATAKANA A, and KATAKANA RU.
[Pa][-6][Ru]
Perl
KATAKANA PA
PROLONGED SOUND MARK replaced by KATAKANA A
KATAKANA RU
[Pa][-6][Ru] is converted to [Pa][A6][Ru] by replacement. primary equal to [ha][a5][ru]. secondary equal to [pa][a5][ru], greater than [ha][a5][ru]. tertiary equal to [pa][-6][ru], lesser than [Pa][A6][Ru]. quartenary greater than [pa][-6][ru].
[according to the article 6.2, JIS X 4061] (1) charset: UTF-8. (2) No limit of the number of characters in the string considered to collate. (3) No character class is added. (4) The following characters are added as collation elements. IDEOGRAPHIC SPACE in the space class. ACUTE ACCENT, GRAVE ACCENT, DIAERESIS, CIRCUMFLEX ACCENT, MACRON, HORIZONTAL BAR, EN DASH, TILDE, PARALLEL TO in the class of descriptive symbols. APOSTROPHE, QUOTATION MARK in the class of parentheses. HYPHEN-MINUS in the class of mathematical symbols. (5) Collation of Latin alphabets with macron and with circumflex is supported. (6) Selected kanji class: the minimum kanji class (Five kanji-like chars). the basic kanji class (Levels 1 and 2 kanji, JIS). the extended kanji class (CJK Unified Ideographs).
THIS MODULE IS OLD THEN IS NOT AWARE OF PERL'S UNICODE ENCODING.
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Copyright(C) 2001, 2007. SADAHIRO Tomoyuki. Japan. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
JIS X 4061 [Collation of Japanese character strings]
JIS X 0208 [7-bits and 8-bits double byte coded Kanji sets for information interchange]
JIS X 0221 [Information technology - Universal Multiple-Octet Coded Character Set (UCS) - part 1 : Architectute and Basic Multilingual Plane]. That is translated from ISO/IEC 10646-1 and introduced into JIS.
Japanese Standards Association (access to JIS) http://www.jsa.or.jp/
RFC 1345 [Character Mnemonics & Character Sets]
To install Lingua::JA::Sort::JIS, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::JA::Sort::JIS
CPAN shell
perl -MCPAN -e shell install Lingua::JA::Sort::JIS
For more information on module installation, please visit the detailed CPAN module installation guide.