The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.


CEDict::Pinyin - Validates pinyin strings


    # This is from the test case provided with this module:

    use CEDict::Pinyin;

    my @good = ("ji2 - rui4 cheng2", "xi'an", "dian4 nao3, yuyan2", "kongzi");
    my @bad  = ("123", "not pinyin", "gu1 gu1 fsck4 fu3");
    my $py   = CEDict::Pinyin->new;

    for (@good) {
      ok($py->isPinyin, "correctly validated good pinyin");
      print "pinyin: " . $py->getSource . "\n";
      print "parts: " . join(', ', @{$py->getParts}) . "\n";

    for (@bad) {
      ok(!$py->isPinyin, "correctly invalidated bad pinyin");
      print "pinyin: " . $py->getSource . "\n";
      print "parts: " . join(', ', @{$py->getParts}) . "\n";


This class helps you validate and parse pinyin. Currently the pinyin must follow some rules about how it is formatted before being considered "valid" by this class's validation method. All valid pinyin syllables are expressed by characters within the 7-bit ASCII range. That means the validation method will fail on a string like "nán nǚ lǎo shào". The pinyin should instead contain numbers after the letter to represent tones. Instead of the string above we should use "nan2 nv lao3 shao4". Being able to accept a string with accented characters that represent the tone of the syllable is a feature I hope to add to a future version of this module. The parser first takes a look at the entires string you pass it to see if it is even worth parsing. The regular expression used is shown below.

/^[A-Za-z]+[A-Za-z1-5,'\- ]*$/

If the pinyin doesn't match this regex, then isPinyin returns false and stops parsing the string. All this means is that if you want to use this module to validate your pinyin but your pinyin is not exactly in the same format as just described then you need cleanup your pinyin strings a little bit first.

Again, hopefully future versions of this class will be more flexible in what is accepted as valid pinyin. However we want to be sure that what we are looking at is really pinyin and not some English words as this module was originally written in part to distinguish between a pinyin string and English. I would also like to keep this idea in future versions, so if you update the class with your own code, please keep that in mind.



Creates a new CEDict::Pinyin object. SCALAR should be a string containing the pinyin you want to work with. If SCALAR is ommited it can be set later using the setSource method.


Sets the source string to work with. Currently only the isPinyin method accesses this attribute. Returns the previously set pinyin string.


Returns the string of pinyin using diacritic marks instead of numbers to represent tone information. For example, if the source pinyin is "nan2 nv lao3 shao4" then $obj->diacritic will return "nán nǚ lǎo shào".


Returns the individually parsed elements of the pinyin source string.


Returns the pinyin source string that was supplied earlier either via the constructor or $obj->setSource.

$obj->isPinyin or $obj>->isPinyin(ARRAYREF)

Validates the pinyin supplied to the constructor or to $obj->setSource(SCALAR). If an ARRAYREF is supplied as an argument, adds each syllable of the parsed pinyin to the array. If a syllable is considered invalid then the method stops parsing and immediately returns false. Returns true otherwise.


Takes a string containing pinyin and returns a regular expression that can be used with the MySQL database (so far only tested against the 5.1 series). Accepts an asterisk ("*") as a wildcard. Note that the isPinyin method will return false when validating such a string, so if you plan on first validating the pinyin then generating the regex, make sure you are validating the string without the asterisks ($string =~ s/\*//g).


Christopher Davaz


Version 0.02004 (Mar 01 2010)


Copyright (c) 2009 Christopher Davaz. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.