NAME
Lingua::ZH::MMSEG Mandarin Chinese segmentation
SYNOPSIS
#!/usr/bin/perl
use
utf8;
use
Lingua::ZH::MMSEG;
my
$zh_string
=
"現代漢語的複合動詞可分三個結構語意關係來探討"
;
my
@phrases
= mmseg(
$zh_string
);
# use MMSEG algorithm
my
@phrases
= fmm(
$zh_string
);
# use Forward Maximum Matching algorithm
while
(<>) {
chomp
;
push
@phrases
, mmseg;
}
# mmseg and fmm will parse $_ automaticly
$_
, word_freq(
$_
)
for
@phrases
;
# you can get phrase frequency by calling word_freq
DESCRIPTION
A problem in computational analysis of Chinese text is that there are no word boundaries in conventionally printed text. Since the word is such a fundamental linguistic unit, it is necessary to identify words in Chinese text so that higher-level analyses can be performed.
Lingua::ZH::MMSEG implements MMSEG original developed by Chih-Hao-Tsai. The whole module is rewritten in pure Perl, and the phrase library is 新酷音 forked from OpenFoundry.
INSTALL
To install this module, just type
cpanm Lingua::ZH::MMSEG
If you don't have cpanm,
FUNCTIONS
mmseg
@phrases
= mmseg(
$zh_string
);
@phrases
= mmseg;
# use $_ automatically
mmseg
convert a mandarin Chinese string to a sequence of phrases using MMSEG algorithm. If there were any english containted in the input string, it simply parse the linked ascii code as one phrase. For example:
$_
=
"這裡有中文Today is Wednesday.這邊又有中文 I go to school on Friday."
;
"$_\n"
for
mmseg;
這裡有
中文
Today is Wednesday.
這邊
又有
中文
I go to school on Friday.
The ascii characters are recognized by /[ -~]+/
.
fmm (Forward Maximum Matching)
@phrases
= fmm(
$zh_string
);
@phrases
= fmm;
# use $_ automatically
fmm
uses forward maximum matching (so called longest match principle) to convert a mandarin Chinese string to a sequence of phrases. It uses the same rule of mmseg
to deal with ascii string. The advantage of fmm
is it has lower complexity compare to mmseg
; the disadvantage is it cannot solve ambiguity when there is multiple way to seperate a string.
word_freq
$freq
= word_freq(
$phrase
);
$freq
= word_freq;
# use $_ automatically
word_freq
return the phrase frequency defined in 新酷音.
AUTHOR
Felix Ren-Chyan Chern (dryman) <idryman@gmail.com>