The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Lingua::ZH::TaBE - Chinese processing via libtabe

SYNOPSIS

    use Lingua::ZH::TaBE;

    my $tabe = Lingua::ZH::TaBE->new(
        tsi_db => '/usr/local/share/tabe/tsiyin/tsi.db'
    );

    # Phrase splitter
    my @words = $tabe->split(
        "·í§Ú­Ì¦b¹q¸£¤¤³B²z¤¤¤å¸ê°T®É¡A¬Û«H¨ä¤¤³Ì´o¤Hªº".
        "ª¬ªp¤§¤@¡A²ö¹L©ó·Q¥´ªº¦r¥´¤£¥X¨Ó¤F¡C"
    );

    # Chaining various components
    print $tabe->Chu("¹D¥i¹D¡A«D±`¹D¡C")    # sentence
        ->chunks->[2]       # «D±`¹D        # chunk
        ->tsis->[0]         # «D±`          # phrase
        ->zhis->[1]         # ±`            # character
        ->yins->[0]         # £¥£µ£½        # pronounciation
        ->zuyins->[0],      # £¥            # phonetic symbols

DESCRIPTION

This module is a Perl interface to the TaBE (Taiwan and Big5 Encoding) library, an unified interface and library dealing with Chinese words, phrases, sentences, and phonetic symbols; it is intended to be used as the foundation of Chinese text processing.

Lingua::ZH::TaBE provides an object-oriented interface (preferred), as well as a procedural interface consisting of all C functions in tabe.h.

Object-Oriented Interface

Lingua::ZH::TaBE

new( tsi_db => $file )

Creates a LibTaBE handle and opens databases.

split( $string [, $method] )

Split the text in $string; returns a list of strings representing the words obtained. You may specify Complex or Backward as $method to use an alternate segmentation algorithm.

Chu(), Chunk(), Tsi(), Zhi(), Yin(), ZuYin()

Constructors for various level of objects, each taking one argument for initialization.

Lingua::ZH::TaBE::Chu

chunks()

Lingua::ZH::TaBE::Chunk

tsis([$method])

Lingua::ZH::TaBE::Tsi

zhis()

yins()

Lingua::ZH::TaBE::Zhi

yins()

ToZhi()

ToZhiCode()

IsBig5Code()

ToPackedBig5Code()

LookupRefCount()

Lingua::ZH::TaBE::Yin

zuyins()

zhis()

ToYin()

ToZuYinSymbolSequence()

Lingua::ZH::TaBE::ZuYin

yin()

zhi()

Procedural Interface

  struct TsiDB       *TsiDBOpen(int type, const char *db_name, int flags);
  int                 TsiInfoLookupPossibleTsiYin(struct TsiDB *tsidb,
                                                    struct TsiInfo *tsi);
  struct TsiYinDB    *TsiYinDBOpen(int type, const char *db_name,
                                     int flags);
  int                 ChuInfoToChunkInfo(struct ChuInfo *chu);
  int                 ChunkSegmentationSimplex(struct TsiDB *tsidb,
                                                 struct ChunkInfo *chunk);
  int                 ChunkSegmentationComplex(struct TsiDB *tsidb,
                                                 struct ChunkInfo *chunk);
  int                 ChunkSegmentationBackward(struct TsiDB *tsidb,
                                                  struct ChunkInfo *chunk);
  int                 TsiInfoLookupZhiYin(struct TsiDB *tsidb,
                                            struct TsiInfo *z);
  ZhiStr              YinLookupZhiList(Yin yin);
  ZuYinSymbolSequence YinToZuYinSymbolSequence(Yin yin);
  Yin                 ZuYinSymbolSequenceToYin(ZuYinSymbolSequence str);
  const Zhi           ZuYinIndexToZuYinSymbol(ZuYinIndex idx);
  ZuYinIndex          ZuYinSymbolToZuYinIndex(ZuYinSymbol sym);
  ZuYinIndex          ZozyKeyToZuYinIndex(int key);
  int                 ZhiIsBig5Code(Zhi zhi);
  ZhiCode             ZhiToZhiCode(Zhi zhi);
  Zhi                 ZhiCodeToZhi(ZhiCode code);
  int                 ZhiCodeToPackedBig5Code(ZhiCode code);
  unsigned long int   ZhiCodeLookupRefCount(ZhiCode code);

CAVEATS

The TsiYin family of fucntions is yet imcomplete.

SEE ALSO

ftp://xcin.linux.org.tw/pub/xcin/libtabe/devel/

http://libtabe.sourceforge.net/

AUTHORS

Autrijus Tang <autrijus@autrijus.org>

COPYRIGHT

Copyright 2003 by Autrijus Tang <autrijus@autrijus.org>.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html

1 POD Error

The following errors were encountered while parsing the POD:

Around line 338:

Non-ASCII character seen before =encoding in '"·í§Ú­Ì¦b¹q¸£¤¤³B²z¤¤¤å¸ê°T®É¡A¬Û«H¨ä¤¤³Ì´o¤Hªº".'. Assuming CP1252