NAME
Convert::Translit, transliterate, build_substitutes - Perl module for string conversion among numerous character sets
SYNOPSIS
use Convert::Translit;
$translator = new Convert::Translit($result_chset);
$translator = new Convert::Translit($orig_chset, $result_chset);
$translator = new Convert::Translit($orig_chset, $result_chset, $verbose);
$result_st = $translator->transliterate($orig_st);
$result_st = Convert::Translit::transliterate($orig_st);
build_substitutes Convert::Translit();
Convert::Translit::build_substitutes();
DESCRIPTION
This module converts strings among 8-bit character sets defined by IETF RFC 1345 (about 128 sets). The RFC document is included so you can look up character set names and aliases; it's also read by the module when composing conversion maps. Failing functions or objects return undef value.
Export_OK Functions:
- transliterate()
-
returns a string in $result_chset for an argument string in $orig_chset, transliterating by a map composed by new().
- build_substitutes()
-
rebuilds the file "substitutes" containing character definitions and approximate substitutions used when a character in $orig_chset isn't defined in $result_chset. For example, "Latin capital A" may be substituted for "Latin capital A with ogonek". It takes a long time to rebuild this file, but you should never need to. Its only source of information is file "rfc1345".
Object methods:
- new()
-
creates a new object for converting from $orig_chset to $result_chset, these being names (or aliases) of 8-bit character sets defined in RFC 1345. If only one argument, then $orig_chset is assumed "ascii". If three arguments, the third is verbosity flag. Verbose output lists approximate substitutions and other compromises.
- transliterate()
-
is same as the function of that name.
- build_substitutes()
-
is same as the function of that name.
FILES
Convert/Translit/rfc1345 (IETF RFC 1345, June 1992)
Convert/Translit/substitutes
METHODOLGY
Only one-to-one character mapping is done, so characters with diacritics (like A-ogonek) are never converted to (letter character, diacritic character) pairs, rather are subject to simplification. If no approximate substitute is available, then a unrelated substitute is chosen, preferably with the same code value. Undefined $orig_chset characters are translated to a chosen indicator character. Transliteration is not guaranteed commutative when substitutions were required. An $orig_chset defined as 7-bit is assumed to be repeated to make an 8-bit set (in the style of "extended ascii"); no such adjustment is made for $result_chset. The few mistakes in the RFC document are corrected in the module.
EXAMPLES
Convert Russian language text from IBM to ASCII encoding:
$xxx = new Convert::Translit("EBCDIC-Cyrillic", "Cyrillic");
$ascii_cyr_st = $xxx->transliterate($ibm_cyr_st);
Convert from plain ASCII (default $orig_chset) to Latin2 (Central European):
$yyy = new Convert::Translit("Latin2");
$cnt_eur_st = $yyy->transliterate($ascii_st);
Since plain ASCII is subset of Latin2, nothing is lost in transliteration.
But going the other direction requires numerous simplifications:
$zzz = new Convert::Translit("Latin2", "ascii");
$ascii_st = $zzz->transliterate($cnt_eur_st);
Back to ASCII again, although substitutions probably mean ($again ne $cnt_eur_st):
$again = $yyy->transliterate($ascii_st);
The example.pl script converts a Polish language phrase from Latin2 to EBCDIC-US.
PORTABILITY
Requires Perl version 5. Developed with MacPerl on Macintosh 68040 OS 7.6.1. Tested on Sun Unix 4.1.3.
AUTHOR
Genji Schmeder <genji@community.net>
Enjoy in good health.
Cieszcie sie dobrym zdrowiem.
Que gozen con salud.
Benutze es heilsam gern!
Genki dewa, yorokobi nasai.
COPYRIGHT
Version 1.03 dated 5 November 1997. Copyright (c) 1997 Genji Schmeder. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
ACKNOWLEDGEMENTS
Chris Leach, author of EBCDIC.pm
Keld Simonsen, author of RFC 1345