NAME

Lingua::JA::Sort::JIS - compares and sorts strings encoded in UTF-8

SYNOPSIS

use Lingua::JA::Sort::JIS qw(jsort);
@result = jsort(@not_sorted);

DESCRIPTION

This module provides some functions to compare and sort strings encoded in UTF-8 using the collation of Japanese character strings.

This module is an implementation of JIS X 4061-1996 and the collation rules are based on that standard.

Collation Levels

The following criteria are considered in order until the collation order is determined. By default, Levels 1 to 4 are applied and Level 5 is ignored (as JIS does).

Level 1: alphabetic ordering.

The character class early appeared in the following list is smaller.

Space characters, Symbols and Punctuations, Digits, Greek Letters,
Cyrillic Letters, Latin letters, Kana letters, ( Kanji ideographs ),
and Geta mark.

In the class, alphabets are collated alphabetically; kana letters are AIUEO-betically (in the Gozyuon order).

For Kanji, see Kanji Classes.

Other characters are collated as defined.

Characters not defined as a collation element are ignored and skipped on collation.

BN: Especially, almost alphabets with any diacritical mark are NOT defined in this implement, excepting Latin vowels with macron or circumflex, because they are not used in Japanese contexts.

Level 2: diacritic ordering.

In the Latin vowels, the order is as shown the following list.

One without diacritical mark, with macron, then with circumflex.

In kana, the order is as shown the following list.

A voiceless kana, the voiced, then the semi-voiced (if exists).
 (eg. Ka before Ga; Ha before Ba before Pa)
Level 3: case ordering.

A small Latin is lesser than the corresponding Capital.

In kana, the order is as shown the following list.

replaced PROLONGED SOUND MARK(U+30FC);
Small kana;
replaced ITERATION MARK (U+309D, U+309E, U+30FD or U+30FE);
then normal kana.

For example, Katakana A + PROLONGED SOUND MARK, Katakana A + Small Katakana A, Katakana A + ITERATION MARK, Katakana A + Katakana A. (see NOTE about the replacement)

Level 4: script ordering.

Hiragana is lesser than katakana.

Level 5: width ordering.

A character that belongs to the block Halfwidth and Fullwidth Forms is greater than the corresponding normal character.

BN: According to the JIS standard, the level 5 should be ignored.

Kanji Classes

There are three kanji classes:

Class 1: the 'saisho' (minimum) kanji class

It comprises five kanji-like chars, i.e. U+3003, U+4EDD, U+3005, U+3006, U+3007 (collated in the JIS order as shown). Any kanji except U+4EDD are ignored on collation.

Class 2: the 'kihon' (basic) kanji class

It comprises JIS levels 1 and 2 kanji in addition to the minimum kanji class. Sorted in the JIS order. Any kanji excepting those defined by JIS X 0208 are ignored on collation.

Class 3: the 'kakucho' (extended) kanji class

All the CJK Unified Ideographs in addition to the minimum kanji class. Sorted in the unicode order.

Methods (OOP)

$jis = Lingua::JA::Sort::JIS->new()
$jis = Lingua::JA::Sort::JIS->new(LEVEL)
$jis = Lingua::JA::Sort::JIS->new(LEVEL, KANJI CLASS)
$jis = Lingua::JA::Sort::JIS->new(CODE REF, LEVEL, KANJI CLASS)

Constructs an instance.

The collation level is specified as a number between 1 and 5. If omitted, level 4 is applied. The kanji class is specified as a number between 1 and 3. If omitted, class 2 is applied.

If a coderef is specified as the first argument, strings given to a collating method are converted by the coderef before making collating keys.

For example, if you want to ignore PROLONGED SOUND MARK ("\xE3\x83\xBC" in UTF-8) on collation,

use Lingua::JA::Sort::JIS;

$jis = Lingua::JA::Sort::JIS->new(
   sub { my $str = shift; $str =~ s/\xE3\x83\xBC//g; $str; }
);

@sorted = $jis->jsort(@strings); # utf-8 encoded

If you want to collate strings encoded in EUC-JP, give the constructor a coderef converting EUC-JP to UTF-8.

use Lingua::JA::Sort::JIS;
$euc = Lingua::JA::Sort::JIS->new(
   sub { some_convertor_from_eucjp_to_utf8($_[0]) }
);

@sorted_euc_jp_strings = $euc->jsort(@euc_jp_strings);
$jis->jsort(LIST)

Sorts a list of strings in the UTF-8 encoding

$jis->jcmp($a, $b)

Japanese Collation version of the cmp operator. It returns 1 ($a is greater than $b) or 0 ($a is equal to $b) or -1 ($a is lesser than $b).

Functions (not-OOP)

jsort(LIST)
jsort(CODE REF, LIST)

Sorts a list of strings in the UTF-8 encoding (as the collation level and the kanji class, the default values are used, and jsort() without any object is identical to bsort()).

If a coderef is specified as the first argument, strings given to a collating method are converted by the coderef before making collating keys.

For example, if you want to collate strings encoded in Shift_JIS, do as following.

use Lingua::JA::Sort::JIS qw(jsort);

$sjis_to_utf8 = \&some_convertor_from_shiftjis_to_utf8;
@sorted = jsort $sjis_to_utf8, @not_sorted;
msort(LIST)
msort(CODE REF, LIST)

Sorts a list of strings in the UTF-8 encoding (the collation level is 4 and the kanji class is 1, m: minimum).

bsort(LIST)
bsort(CODE REF, LIST)

Sorts a list of strings in the UTF-8 encoding (the collation level is 4 and the kanji class is 2, b: basic).

xsort(LIST)
xsort(CODE REF, LIST)

Sorts a list of strings in the UTF-8 encoding (the collation level is 4 and the kanji class is 3, x: extented).

fsort(LIST)
fsort(CODE REF, LIST)

Sorts a list of strings in the UTF-8 encoding (the collation level is 5 and the kanji class is 2, f: fullwidth).

jcmp( [ CODEREF ], $a, $b, [ LEVEL, KANJI CLASS ])

Japanese Collation version of the cmp operator. It returns 1 ($a is greater than $b) or 0 ($a is equal to $b) or -1 ($a is lesser than $b).

The LEVEL (collation level) is specified as a number between 1 and 5. If omitted, level 4 is applied.

The KANJI CLASS (kanji class) is specified as a number between 1 and 3. If omitted, class 2 is applied.

If CODE REF is specified as the first argument, strings given to a collating method are converted by the coderef before making collating keys.

The CODE REF, LEVEL and the KANJI CLASS can be omitted if not necessary.

e.g. jcmp("perl", "Perl") returns -1 and jcmp("perl", "Perl", 2) returns 0 since "perl" is tertiary and quarternary less than "Perl", and secondary equal to.

Advanced Matters

karr([ CODE REF ], STRING, [ KANJI CLASS ] )
kcmp(KEY ARRAY, KEY ARRAY, [ LEVEL ])

These functions allow you to do the Schwartzian transform.

karr() makes KEY ARRAY from STRING.

kcmp() returns 1 (The first KEY ARRAY is greater than the second KEY ARRAY) or 0 (The first KEY ARRAY is equal to the second KEY ARRAY) or -1 (The first KEY ARRAY is lesser than the second KEY ARRAY).

The CODE REF, LEVEL and the KANJI CLASS can be omitted if not necessary.

The following example is sorting by "yomi-hyoki" collation, in which "yomi" (or pronunciation) is used as the first sorting key, and "hyoki" (or spell) is used as the second sorting key.

by OOP
use Lingua::JA::Sort::JIS;

$jis = Lingua::JA::Sort::JIS->new();

foreach(ysort(@data)){
  print "@$_\n";
}

sub ysort {
  map { $_->[0] }
  sort{
    $jis->kcmp($a->[1], $b->[1]) ||
    $jis->kcmp($a->[2], $b->[2])
  }
  map { [$_, $jis->karr($_->[1]),
             $jis->karr($_->[0]) ] } @_;
}
by not-OOP
use Lingua::JA::Sort::JIS qw(kcmp karr);

foreach(ysort(@data)){
  print "@$_\n";
}

sub ysort {
  map { $_->[0] }
  sort{ kcmp($a->[1], $b->[1]) ||
        kcmp($a->[2], $b->[2]) }
  map { [$_, karr($_->[1]), karr($_->[0]) ] } @_;
}
getorder()

In the list context, it returns the collation element hash; otherwise, it returns the reference of that hash.

In the collation element hash, each key is the collation element string and each value is the anonymous array with 5 elements.

You can manipulate the collation element hash like as follows.

my $order = getorder();

# delete 'X' from the collation element hash
delete $order->{'X'};

# swap the collation order between 'b' and 'B';
@$order{'B', 'b'} = @$order{'b', 'B'};

# add a new collation element HIRAGANA LETTER VU;

my $hira_vu = "\xE3\x82\x94";
my $kata_vu = "\xE3\x83\xB4";

$order->{$hira_vu} = [ @{ $order->{$kata_vu} } ];
-- $order->{$hira_vu}[3];
 # HIRAGANA VU to be quarternary lesser than KATAKANA VU.

NOTE

Replacement of PROLONGED SOUND MARK and ITERATION MARKs

        RFC1345 UCS
	[*5]    U+309D  HIRAGANA ITERATION MARK
	[+5]    U+309E  HIRAGANA VOICED ITERATION MARK
	[-6]    U+30FC  KATAKANA-HIRAGANA PROLONGED SOUND MARK
	[*6]    U+30FD  KATAKANA ITERATION MARK
	[+6]    U+30FE  KATAKANA VOICED ITERATION MARK

To represent Japanese characters, RFC 1345 Mnemonic characters enclosed by brackets are used below.

These characters, if replaced, are secondary equal to the replacing kana, while ternary not equal to.

KATAKANA-HIRAGANA PROLONGED SOUND MARK

The PROLONGED MARK is repleced by normal vowel or nasal katakana corresponding to the preceding kana if exists.

  eg.	[Ka][-6] to [Ka][A6]
	[bi][-6] to [bi][I6]
	[Pi][YU][-6] to [Pi][YU][U6]
	[N6][-6] to [N6][N6]
HIRAGANA- and KATAKANA ITERATION MARKs

The ITERATION MARKs (VOICELESS) are repleced by normal kana corresponding to the preceding kana if exists.

  eg.	[Ka][*6] to [Ka][Ka]
	[Do][*5] to [Do][to]
	[n5][*5] to [n5][n5]
	[Pu][*6] to [Pu][Hu]
	[Pi][YU][*6] to [Pi][YU][Yu]
HIRAGANA- and KATAKANA VOICED ITERATION MARKs

The VOICED ITERATION MARKs are repleced by the voiced kana corresponding to the preceding kana if exists.

  eg.	[ha][+5] to [ha][ba]
	[Pu][+5] to [Pu][bu]
	[Ko][+6] to [Ko][Go]
	[U6][+6] to [U6][Vu]
Cases of no replacement

Otherwise, no replacement occurs. Especially in the cases when these marks follow any character except kana.

The characters not replaced are primary greater than any kana (see "Collate.txt").

  eg.	CJK followed by PROLONGED SOUND MARK
	DIGIT followed by ITERATION
	[A6][+6] ([A6] has no voiced variant)
Example

For example, the Japanese string [Pa][-6][Ru] (spell of Perl in Japanese) has three collation elements: KATAKANA PA, PROLONGED SOUND MARK replaced by KATAKANA A, and KATAKANA RU.

   [Pa][-6][Ru] is converted to [Pa][A6][Ru] by replacement.
		primary equal to [ha][a5][ru].
		secondary equal to [pa][a5][ru], greater than [ha][a5][ru].
		tertiary equal to [pa][-6][ru], lesser than [Pa][A6][Ru].
		quartenary greater than [pa][-6][ru].

About this implementation

                         [according to the article 6.2, JIS X 4061]

(1) charset: UTF-8.

(2) No limit of the number of characters in the string considered
    to collate.

(3) No character class is added.

(4) The following characters are added as collation elements.

    IDEOGRAPHIC SPACE in the space class.

    ACUTE ACCENT, GRAVE ACCENT, DIAERESIS, CIRCUMFLEX ACCENT,
    MACRON, HORIZONTAL BAR, EN DASH, TILDE, PARALLEL TO
    in the class of descriptive symbols.

    APOSTROPHE, QUOTATION MARK in the class of parentheses.

    HYPHEN-MINUS in the class of mathematical symbols.

(5) Collation of Latin alphabets with macron and with circumflex
    is supported.

(6) Selected kanji class:
     the minimum kanji class (Five kanji-like chars).
     the basic kanji class (Levels 1 and 2 kanji, JIS).
     the extended kanji class (CJK Unified Ideographs).

CAVEAT

THIS MODULE IS OLD THEN IS NOT AWARE OF PERL'S UNICODE ENCODING.

AUTHOR

SADAHIRO Tomoyuki <SADAHIRO@cpan.org>

Copyright(C) 2001, 2007. SADAHIRO Tomoyuki. Japan. All rights reserved.

This module is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

SEE ALSO

  • JIS X 4061 [Collation of Japanese character strings]

  • JIS X 0208 [7-bits and 8-bits double byte coded Kanji sets for information interchange]

  • JIS X 0221 [Information technology - Universal Multiple-Octet Coded Character Set (UCS) - part 1 : Architectute and Basic Multilingual Plane]. That is translated from ISO/IEC 10646-1 and introduced into JIS.

  • Japanese Standards Association (access to JIS) http://www.jsa.or.jp/

  • RFC 1345 [Character Mnemonics & Character Sets]