NAME
Lingua::ZH::Toke - Chinese Tokenizer on steroids
SYNOPSIS
use Lingua::ZH::Toke; # add 'utf8' to use unicode strings
# Create Lingua::ZH::Toke::Sentence object (->Sentence also works)
my $token = Lingua::ZH::Toke->new( '¨º¤H«o¦b/¿O¤õÁñ¬À³B/¯qµo·N¿³Áñ¬À' );
# Easy tokenization via array deferencing
print $token->[0] # Fragment - ¨º¤H«o¦b
->[2] # Phrase - «o¦b
->[0] # Character - «o
->[0] # Pronounciation - £¢£º£®£¿
->[2]; # Phonetic - £®
# Magic histogram via hash deferencing
print $token->{'¨º¤H«o¦b'}; # 1 - One such fragment there
print $token->{'·N¿³Áñ¬À'}; # 1 - One such phrase there
print $token->{'µo·N¿³Áñ'}; # undef - That's not a phrase
print $token->{'¬À'}; # 2 - Two such character there
print $token->{'£¸£¿'}; # 2 - Two such pronounciation: ¯q·N
print $token->{'£¹'}; # 3 - Three such phonetics: ¨º¤õ³B
# Iteration over fragments
while (my $fragment = <$token>) {
# Iteration over phrases
while (my $phrase = <$token>) {
# ...
}
}
DESCRIPTION
This module puts a thin wrapper around Lingua::ZH::TaBE, by blessing refereces to TaBE's objects into its English counterparts.
Besides offering more readable class names, this module also offers various overloaded methods for tokenization; please see "SYNOPSIS" for the three major ones.
Since Lingua::ZH::TaBE is a Big5-oriented module, we also provide a simple utf8 layer around it; if you have Perl version 5.6.1 or later, just use this:
use utf8;
use Lingua::ZH::Toke 'utf8';
With the utf8
flag set, all Toke objects will stringify to unicode strings, and constructors will take either unicode strings, or big5-encoded bytestrings.
Note that on Perl 5.6.x, Encode::compat is needed for the utf8
feature to work.
METHODS
The constructor methods correspond to the six object levels: ->Sentence
, ->Fragment
, ->Phrase
, ->Character
, ->Pronounciation
and ->Phonetic
. Each of them takes one string argument, representing the string to be tokenized.
The ->new
method is an alias to ->
Sentence>.
All object methods, except ->new
, are passed to the underlying Lingua::ZH::TaBE object.
CAVEATS
This module does not care about efficiency or memory consumption yet, hence it's likely to fail miserably if you demand either of them. Patches welcome.
As the name suggests, the chosen interface is very bizzare. Use it at the risk of your own sanity.
SEE ALSO
Lingua::ZH::TaBE, Encode::compat, Encode
AUTHORS
Autrijus Tang <autrijus@autrijus.org>
COPYRIGHT
Copyright 2003 by Autrijus Tang <autrijus@autrijus.org>.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 19:
Non-ASCII character seen before =encoding in ''¨º¤H«o¦b/¿O¤õÁñ¬À³B/¯qµo·N¿³Áñ¬À''. Assuming CP1252