-
-
02 Jan 2012 08:38:55 UTC
- Distribution: Lingua-ZH-MMSEG
- Module version: 0.4005
- Source (raw)
- Browse (raw)
- Changes
- How to Contribute
- Issues (1)
- Testers (990 / 85 / 4)
- Kwalitee
Bus factor: 0- 86.32% Coverage
- License: open_source
- Perl: v5.8.6
- Activity
24 month- Tools
- Download (741.6KB)
- MetaCPAN Explorer
- Permissions
- Subscribe to distribution
- Permalinks
- This version
- Latest version
- Dependencies
- unknown
- Reverse dependencies
- CPAN Testers List
- Dependency graph
NAME
Lingua::ZH::MMSEG Mandarin Chinese segmentation
SYNOPSIS
#!/usr/bin/perl use utf8; use Lingua::ZH::MMSEG; my $zh_string="現代漢語的複合動詞可分三個結構語意關係來探討"; my @phrases = mmseg($zh_string); # use MMSEG algorithm my @phrases = fmm($zh_string); # use Forward Maximum Matching algorithm while (<>) { chomp; push @phrases, mmseg; } # mmseg and fmm will parse $_ automaticly print $_, word_freq($_) for @phrases; # you can get phrase frequency by calling word_freq
DESCRIPTION
A problem in computational analysis of Chinese text is that there are no word boundaries in conventionally printed text. Since the word is such a fundamental linguistic unit, it is necessary to identify words in Chinese text so that higher-level analyses can be performed.
Lingua::ZH::MMSEG implements MMSEG original developed by Chih-Hao-Tsai. The whole module is rewritten in pure Perl, and the phrase library is 新酷音 forked from OpenFoundry.
INSTALL
To install this module, just type
cpanm Lingua::ZH::MMSEG
If you don't have cpanm,
curl -LO http://bit.ly/cpanm chmod +x cpanm sudo cp cpanm /usr/local/bin
FUNCTIONS
mmseg
@phrases = mmseg($zh_string); @phrases = mmseg; # use $_ automatically
mmseg
convert a mandarin Chinese string to a sequence of phrases using MMSEG algorithm. If there were any english containted in the input string, it simply parse the linked ascii code as one phrase. For example:$_ = "這裡有中文Today is Wednesday.這邊又有中文 I go to school on Friday."; print "$_\n" for mmseg; 這裡有 中文 Today is Wednesday. 這邊 又有 中文 I go to school on Friday.
The ascii characters are recognized by
/[ -~]+/
.fmm (Forward Maximum Matching)
@phrases = fmm($zh_string); @phrases = fmm; # use $_ automatically
fmm
uses forward maximum matching (so called longest match principle) to convert a mandarin Chinese string to a sequence of phrases. It uses the same rule ofmmseg
to deal with ascii string. The advantage offmm
is it has lower complexity compare tommseg
; the disadvantage is it cannot solve ambiguity when there is multiple way to seperate a string.word_freq
$freq = word_freq($phrase); $freq = word_freq; # use $_ automatically
word_freq
return the phrase frequency defined in 新酷音.AUTHOR
Felix Ren-Chyan Chern (dryman)
<idryman@gmail.com>
LICENSE AND COPYRIGHT
Module Install Instructions
To install Lingua::ZH::MMSEG, copy and paste the appropriate command in to your terminal.
cpanm Lingua::ZH::MMSEG
perl -MCPAN -e shell install Lingua::ZH::MMSEG
For more information on module installation, please visit the detailed CPAN module installation guide.