The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Unicruft - Perl interface to the unicruft transliteration library

SYNOPSIS

 use Unicruft;
 
 $libversion = Unicruft::library_version();
 
 $u8str = Unicruft::latin1_to_utf8($l1str);
 $astr  = Unicruft::utf8_to_ascii($u8str);
 $l1str = Unicruft::utf8_to_latin1($u8str);
 $l1str = Unicruft::utf8_to_latin1_de($u8str);
 $u8str = Unicruft::utf8_to_utf8_de($u8str);

DESCRIPTION

The perl Unicruft package provides a perl interface to the libunicruft library, which is itself derived in part from the Text::Unidecode perl module.

EXPORTS

Nothing is exported by default, but the Unicruft module support the following export tags:

:std

Standard conversion functions (those without a "ux_" prefix)

:guts

Low-level conversion functions (those with a "ux_" prefix).

:all

All conversion functions exported by :std and :guts.

HIGH-LEVEL CONVERSION FUNCTIONS

library_version

Returns the version string of the unicruft C library against which this perl module was compiled.

latin1_to_utf8

 $u8str = Unicruft::latin1_to_utf8($l1str);

Converts the Latin-1 (ISO-8859-1) string $l1str to UTF-8. This task is better accomplished either with perl's utf8::upgrade() function or the perl Encode module; it is included here only for completeness' sake.

$l1str may be either a byte-string or a perl-native UTF-8 string (i.e. a scalar with the SvUTF8 flag set). The returned string $u8str will have its UTF-8 flag set.

utf8_to_ascii

 $astr  = Unicruft::utf8_to_ascii($u8str);

Approximate the UTF-8 string $u8str as 7-bit ASCII. This is basically just a (fast) re-implementation of Text::Unidecode::unidecode($u8str).

$u8str may be either a byte-string (assumed to contain a valid UTF-8 byte sequence) or a perl-native UTF-8 string (i.e. a scalar with the SvUTF8 flag set). The returned string $astr will have its UTF-8 flag cleared (although this is pretty arbitrary here, since 7-bit ASCII is also valid UTF-8).

utf8_to_latin1

 $l1str = Unicruft::utf8_to_latin1($u8str);

Approximate the UTF-8 string $u8str as 8-bit Latin-1 (ISO-8859-1).

$u8str may be either a byte-string (assumed to contain a valid UTF-8 byte sequence) or a perl-native UTF-8 string (i.e. a scalar with the SvUTF8 flag set). The returned string $l1str will have its UTF-8 flag cleared.

utf8_to_latin1_de

 $l1str = Unicruft::utf8_to_latin1_de($u8str);

Approximate the UTF-8 string $u8str as 8-bit Latin-1 (ISO-8859-1) using only characters which occur in contemporary German orthography.

$u8str may be either a byte-string (assumed to contain a valid UTF-8 byte sequence) or a perl-native UTF-8 string (i.e. a scalar with the SvUTF8 flag set). The returned string $l1str will have its UTF-8 flag cleared.

utf8_to_utf8_de

 $u8str = Unicruft::utf8_to_utf8_de($u8str);

Approximate the UTF-8 string $u8str as 8-bit-safe UTF-8 using only characters which occur in contemporary German orthography. Really just a wrapper for:

 utf8::upgrade(my $s = Unicruft::utf8_to_latin1_de($u8str));
 return $s;

LOW-LEVEL UTILITY FUNCTIONS

The following functions are available, but not expected to be of much use to the casual user.

ux_latin1_bytes

 $bytes = ux_latin1_bytes($string);

Returns an latin-1 encoded byte string representing its argument. Respects perl UTF-8 flag.

ux_utf8_bytes

 $bytes = ux_latin1_bytes($string);

Returns an UTF-8 encoded byte string representing its argument. Respects perl UTF-8 flag.

LOW-LEVEL CONVERSION FUNCTIONS

For each conversion function X_to_Y, there is an underlying ux_X_to_Y function which places stricter requirements on its argument string (potentially downgrading it to a byte-string), but which is slightly faster since no copying or perl-level conditionals are required.

ux_latin1_to_utf8

Like latin1_to_utf8(), but requires its argument to be a Latin-1-encoded byte string.

ux_utf8_to_ascii

Like utf8_to_ascii(), but requires its argument to be a UTF-8-encoded byte string.

ux_utf8_to_latin1

Like utf8_to_latin1(), but requires its argument to be a UTF-8-encoded byte string.

ux_utf8_to_latin1_de

Like utf8_to_latin1_de(), but requires its argument to be a UTF-8-encoded byte string.

SEE ALSO

Text::Unidecode(3pm), unicruft(1), perl(1).

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2009-2013 by Bryan Jurish

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.