The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::ZH::CCDICT - An interface to the CCDICT Chinese dictionary

SYNOPSIS

  use Lingua::ZH::CCDICT;

  my $dict = Lingua::ZH::CCDICT->new( storage => 'InMemory' );

DESCRIPTION

This module provides a Perl interface to the CCDICT dictionary created by Thomas Chin. This dictionary is indexed by Unicode character number (traditional character), and contains information about these characters.

CCDICT is released under a Creative Commons Attribution License (version 2.5).

The dictionary contains the following information, though not all information is avaialable for every character.

  • Radical number

    The number of the radical. Always available.

  • Index

    The total number of strokes minus the number of strokes in the radical. Always available.

  • Total stroke count

    The total number of strokes in the character.

  • Cangjie

    The Cangjie Chinese input system code.

  • Four Corner

    The Four Corner Chinese input system code.

In addition, the dictionary contains English definitions (often multiple definitions), and romanizations for the character in different languages and systems. The romanizations available include Pinjim for Hakka, Jyutping for Cantonese and (Hanyu) Pinyin for Mandarin.

DICTIONARY BUGS

The CCDICT dictionary is distributed by Thomas Chin in a simple but non-standard, textual ASCII-only format. I've tried to work around errors or ambiguities in the dictionary data, although there are probably still oddities lurking. Please send bug reports to me so I can figure out whether the error is in my code or the dictionary itself.

STORAGE

This module is capable of parsing the CCDICT format file, and can also store the data in other formats (just Berkeley DB fo rnow).

Each storage system is implemented via a module in the Lingua::ZH::CCDICT::Storage::* class hierarchy. All of these modules are subclasses of Lingua::ZH::CCDICT class, and implement its methods for searching the dictionary.

In addition some storage classes may offer additional methods.

Storage Subclasses

The following storage subclasses are available:

  • Lingua::ZH::CCDICT::Storage::InMemory

    This class stores the entire parsed dictionary in memory.

  • Lingua::ZH::CCDICT::Storage::BerkeleyDB

    This class can convert the CCDICT source file to a set of BerkeleyDB files.

USAGE

This module allows you to look up information in the dictionary based on a number of keys. These include the Unicode character (as a character, not its number), stroke count, radical number, and any of the various romanization systems.

METHODS

This class provides the following methods.

Lingua::ZH::CCDICT->new(...)

This method always takes at least one parameter, "storage". This indicates what storage subclass to use. The current options are "InMemory" and "BerkeleyDB".

Any other parameters given will be passed to the appropriate subclass's new() method.

$dict->parse_source_file($filename)

If you don't specify a file, then it will use the data file distributed with this module. This is probably what you want, unless you have a local copy of the dictionary that you want to work with. Note that the dictionary format has changes a fair bit between versions, so this probably won't work with much older or newer versions of the CCDICT data.

This method is what does the real work of creating a dictionary. Note that if you are not using the InMemory storage subclass, you only need to parse the source file once, and then you can reuse the stored data.

MATCH METHODS

When doing a lookup based on the romanization of a character, the tone is indicated with a number at the end of the syllable, as opposed to using the Unicode character combining the latin letter with the diacritic.

In addition, lookups based on a Pinyin romanization should use the u-with-umlaut character (character 252 in Unicode) rather than two "u" characters.

The return value for any lookup will be an object in a Lingua::ZH::CCDICT::ResultSet subclass.

Result sets always return matches in ascending Unicode character order.

$ccdict->match_unicode(@chars)

This method matches on one or more Unicode characters. Unicode characters should be given as Perl characters (i.e. chr(0x7D20)), not as a number.

This dictionary index uses traditional Chinese characters. Simplified character lookups will not work (but you could use Encode::HanConvert to convert simple to traditional first).

$ccdict->match_radical(@numbers)

Given a set of numbers, this method returns those characters containing the specified radical(s).

$ccdict->match_index(@numbers)

Given a set of numbers, this method returns those characters containing the specified index(es).

$ccdict->match_stroke_count(@numbers)

Given a set of numbers, this method returns those characters containing the specified number(s) of strokes.

$ccdict->match_cangjie(@codes)

Given a set of Cangjie codes, this method returns the character(s) for those code(s).

$ccdict->match_four_corner(@codes)

Given a set of Four Corner codes, this method returns the character(s) for those code(s).

$ccdict->match_pinjim(@romanizations)

$ccdict->match_jyutping(@romanizations)

$ccdict->match_pinyin(@romanizations)

$ccdict->all_characters()

Returns a result set containing all of the characters in the dictionary.

$ccdict->entry_count()

Returns the number of entries in the dictionary

ENVIRONMENT VARIABLES

There are several environment variables you can set to change this module's behavior.

  • CCDICT_DEBUG_SOURCE

    Causes a warning when bad data is enountered in the ccdict dictionary source. This is primarily useful if you want to find bugs in the dictionary itself.

  • CCDICT_VERBOSE

    Tells the module to give you progress reports when parsing the source file. These are sent to STDERR.

AUTHOR

David Rolsky <autarch@urth.org>

COPYRIGHT

Copyright (c) 2002-2007 David Rolsky. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

CCDICT is copyright (c) 1995-2006 Thomas Chin.

SEE ALSO

Lingua::ZH::CEDICT - for converting between Chinese and English.

Encode::HanConvert - for converting between simplified and traditional characters in various character sets.

http://www.chinalanguage.com/dictionaries/CCDICT/ - the home of the CCDICT dictionary.