Lingua::ZH::HanConvert - convert between Traditional and Simplified Chinese characters
#!perl -lw use Lingua::ZH::HanConvert qw(simple trad); use utf8; my $t = "國"; # Traditional symbol for "country", unicode 22283 # or: my $t = v22283; print simple($t); # Simplified "country", 国 (unicode 22269) $s = "鱼"; # Simplified symbol for "fish", unicode 40060 # or: $s = v40060; print trad($s); # Traditional "fish", 魚 (unicode 39970)
In the 1950's, the Chinese government simplified over 2000 Chinese characters. Taiwan and Hong Kong still use the traditional characters. The simplified characters are hard to read if you only know the traditional ones, and vice-versa. This module attempts to convert Chinese text between the two forms, using character-by-character transliteration.
Note that this module only handles text in the Unicode UTF-8 character set. If you need to convert between the Big5 and GB character sets, then please look at Text::IConv, or use the
HanConvert Perl script which comes with this module.
simple takes a string, converts any traditional Chinese characters (such as 國, unicode U+570B, meaning "country") to the corresponding simplified characters (like 国, unicode U+56FD, also meaning "country"), and returns the result. Characters which are not traditional Chinese do not change.
trad does the reverse; it converts any simplified Chinese characters to the corresponding traditional characters. Characters which are not simplified Chinese do not change.
If a simplified character has two or more corresponding traditional characters, then it will be replaced by all of them, enclosed in square brackets. To use different characters instead of the square brackets, give them as the second and third arguments to
trad. The same applies where a traditional character has two or more corresponding simplified forms, but this happens much more rarely.
There may be mistakes in the transliterations. A number of data sources were used to build the transliteration tables, including dictionaries and the Unicode consortium's Unihan database, but some mappings may be incorrect or missing.
Some characters which are simplified forms are also traditional forms. For example, 面, unicode U+9762, is the simplified form of 麵, unicode U+9EB5, meaning "noodles"; but it is also the character for "face" in both traditional and simplified writing. Most character mapping lists say that simplified 面 (U+9762) can correspond to traditional 麵 (U+9EB5), but do not mention that simplified 面 (U+9762) can map to traditional 面 (U+9762); common sense makes this is obvious to a human who comes across this character in a text, but not to a computer program. To provide this module with that extra information, it has been assumed that any simplified form which appears in the Big5 character set is also a traditional form. In some cases, this assumption may be incorrect.
The transliteration mappings could be improved. Ideally, I'd like to see the module performing intelligent transliteration of ambiguous characters based on context, if suitable data sources were available. See
http://www.basistech.com/articles/C2C.html for a discussion of transliteration issues.
Some differences in styles of Chinese writing are not related to simplified characters. For instance, the mainland Chinese word for "computer" differs from the word used in Taiwan. Colloquial Cantonese writing is different from Mandarin writing, and everyday Cantonese text such as "佢係唔係我㗎" ("is it mine?") contains characters and phrases which may be unfamiliar to a Mandarin-speaking reader. These issues are beyond the scope of this module; analogously, a module which converted American English spelling into British English spelling would not change the word "gasoline" into the word "petrol".
The characters in this documentation may not display correctly unless the program you are reading it with is unicode-aware.
If you just want to convert some text, you might want to use trad2simp and simp2trad, the Perl scripts which come with this module.
Much of the data used by this module is taken from the Unicode consortium's Unihan database, available from
ftp://ftp.unicode.org. Thanks to them for compiling the data and making it freely available.
David Chan <email@example.com>
Copyright (C) 2001, David Chan. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 829:
Non-ASCII character seen before =encoding in '"國";'. Assuming UTF-8