MARC::Charset - A module for doing MARC-8/UTF8 translation
use MARC::Charset; ## create a MARC::Charset object my $charset = MARC::Charset->new(); ## a string containing the Ansel value for a copyright symbol my $ansel = chr(0xC3) . ' copyright 1969'. ## the same string, but now encoded in UTF8! my $utf8 = $charset->to_utf8($extLatin);
MARC::Charset is a package that allows you to easily convert between the MARC-8 character encodings and Unicode (UTF-8). The Library of Congress maintains some essential mapping tables and information about the MARC-8 and Unicode environments at:
http://www.loc.gov/marc/specifications/spechome.html
MARC::Charset is essentially a Perl implementation of the specifications found at LC, and supports the following character sets:
Latin (Basic/Extended + Greek Symbols, Subscripts and Superscripts)
Hebrew
Cyrillic (Basic + Extended)
Arabic (Basic + Extended)
Greek
East Asian Characters
Includes 13,478 "han" characters, Japanese Hiragana and Katakana (172 characters), Korean Hangul (2,028 characters), East Asian Punctuation Marks (25 characters), "Component Input Method" Characters (35 characters)
The constructor which will return MARC::Charset object. If you like you can pass in the default G0 and G1 charsets (using the g0 and g1 parameters, but if you don't ASCII/Ansel will be assumed.
## for standard characters sets: ASCII and Ansel my $cs = MARC::Charset->new(); ## or if you want to specify Arabic Basic + Extended as the G0/G1 character ## sets. my $cs = MARC::Charset->new( g0 => MARC::Charset::ArabicBasic->new(), g1 => MARC::Charset::ArabicExtended->new() );
If you would like diagnostics turned on pass in the DIAGNOSTICS parameter and set it to a value that will evaluate to true (eg. 1).
my $cs = MARC::Charset->new( diagnostics => 1 );
Pass to_utf8() a string of MARC8 encoded characters and get back a string of UTF8 characters. to_utf8() will handle escape sequences within the string that change the working character sets to Greek, Hebrew, Arabic (Basic + Extended), Cyrillic (Basic + Extended)...but not 32 bit East Asian (see TODO).
Returns an object representing the character set that is being used as the first graphic character set (G0). If you pass in a MARC::Charset::* object you will set the G0 character set, and as a side effect you'll get the previous G0 value returned to you. You probably don't ever need to call this since character set changes are handled when you call to_utf8(), but it's here if you want it.
## set the G0 character set to Greek my $charset = MARC::Charset->new(); $charset->g0( MARC::Charset::Greek->new() );
Same as g0() above, but operates on the second graphic set that is available.
to_marc8()
A function for going from Unicode to MARC-8 character encodings.
Support for 32bit MARC-8 characters:
This concerns the East Asian character sets: Han, Hiragana, Katakana, Hangul and Punctuation. I'm a bit confused about whether 7/8 bit character sets can interoperate with 32 bit character sets. For example if ASCII is designated as the working G0 character set, and East Asian as the working G1 character set. While I've tried to program towards supporting 32 bit character sets I need to know exactly how they are implemented in the 'real world'. So if you have any East Asian MARC data please email it to me!!
v.01 - 2002.07.17 (ehs)
To install MARC::Charset, copy and paste the appropriate command in to your terminal.
cpanm
cpanm MARC::Charset
CPAN shell
perl -MCPAN -e shell install MARC::Charset
For more information on module installation, please visit the detailed CPAN module installation guide.