Text::Util::Chinese - A collection of subroutines for processing Chinese Text
The subroutines provided by this module are for processing Chinese text. Conventionally, all input strings are assumed to be wide-characters. No `decode_utf8` or `utf8::decode` were done in this module. Users of this module should deal with input-decoding first before passing values to these subroutines.
Given the fact that corpus files are usually large, it may be a good idea to avoid slurping the entire input stream. Conventionally, subroutines in this modules accept "input iterator" as its way to receive a small piece of corpus at a time. The "input iterator" is a CodeRef that returns a string every time it is called, or undef if there are nothing more to be processed. Here's a trivial example to open a file as an input iterator:
sub open_as_iterator { my ($path) = @_ open my $fh, '<', $path; return sub { my $line = <$fh>; return undef unless defined($line); return decode_utf8($line); } } my $input_iter = open_as_iterator("/data/corpus.txt");
This $input_iter can be then passed as arguments to different subroutines.
$input_iter
Although in the rest of this document, `Iter` is used as a Type notation for iterators. It is the same as a CODE reference.
This extracts words from Chinese text. A word in Chinese text is a token with N charaters. These N characters is often used together in the input and therefore should be a meaningful unit.
The input parameter is a iterator -- a subroutine that must return a string of Chinese text each time it is invoked. Or, when the input is exhausted, it must return undef. For example:
open my $fh, '<', 'book.txt'; my $word_iter = word_iterator( sub { my $x = <$fh>; return decode_utf8 $x; });
The type of return value is Iter (CODE ref).
This does the same thing as word_iterator, but retruns the exhausted list instead of iterator.
word_iterator
For example:
open my $fh, '<', 'book.txt'; my $words = extract_words( sub { my $x = <$fh>; return decode_utf8 $x; });
The type of return value is ArrayRef[Str].
It is likely that this subroutine returns an empty ArrayRef with no contents. It is only useful when the volume of input is a leats a few thousands of characters. The more, the better.
This subroutine extract meaningful tokens that are prefix or suffix of input.
The 2nd argument $opts is a HashRef with parameters threshold and lengths. threshold should be an Int, lengths should be an ArrayRef[Int] and that constraints the lengths of prefixes and suffixes to be extracted.
$opts
threshold
lengths
The default value for threshold is 9, while the default value for lengths is [2,3]
[2,3]
Similar to presuf_iterator, but returns a ArrayRef[Str] instead.
presuf_iterator
This subroutine split input into sentences. It takes an text iterator, and returns another one.
This subroutine split input into smallelr phrases. It takes an text iterator, and returns another one.
This subroutine split text into tokens, where each token is the same writing script.
This subroutine does a naive test on the input $text and returns true if $text looks like it is written in Simplified Chinese.
$text
Kang-min Liu <gugod@gugod.org>
Unlicense https://unlicense.org/
To install Text::Util::Chinese, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Util::Chinese
CPAN shell
perl -MCPAN -e shell install Text::Util::Chinese
For more information on module installation, please visit the detailed CPAN module installation guide.