package Unicruft;
use 5.008004;
use strict;
use warnings;
use Carp;
use AutoLoader;
use Exporter;
#use Encode; ##-- slower than pack/unpack!
our @ISA = qw(Exporter);
our $VERSION = '0.06';
require XSLoader;
XSLoader::load('Unicruft', $VERSION);
# Preloaded methods go here.
#require Unicruft::Whatever;
# Autoload methods go after =cut, and are processed by the autosplit program.
##======================================================================
## Exports
##======================================================================
our (%EXPORT_TAGS, @EXPORT_OK, @EXPORT);
BEGIN {
%EXPORT_TAGS =
(
std => [qw(latin1_to_utf8 utf8_to_ascii utf8_to_latin1 utf8_to_latin1_de utf8_to_utf8_de)],
guts => [qw(ux_latin1_to_utf8 ux_utf8_to_ascii ux_utf8_to_latin1 ux_utf8_to_latin1_de),
qw(ux_latin1_bytes ux_utf8_bytes),
],
);
$EXPORT_TAGS{all} = [@{$EXPORT_TAGS{std}}, @{$EXPORT_TAGS{guts}}];
@EXPORT_OK = @{$EXPORT_TAGS{all}};
@EXPORT = qw();
}
##======================================================================
## Constants
##======================================================================
##======================================================================
## Utils
##======================================================================
## $u8bytes = ux_utf8_bytes($str)
## + returns UTF-8 byte-string encoded version of $str; respects perl UTF-8 flag
sub ux_utf8_bytes {
return utf8::is_utf8($_[0]) ? pack('C0C*',unpack('U0C*',$_[0])) : $_[0];
}
## $l1bytes = ux_latin1_bytes($str)
## + returns Latin-1 byte-string encoded version of $str; respects perl UTF-8 flag
sub ux_latin1_bytes {
return utf8::is_utf8($_[0]) ? pack('C0C*',unpack('U0U*',$_[0])) : $_[0];
}
##======================================================================
## Wrappers
##======================================================================
## $u8str = latin1_to_utf8($l1str)
sub latin1_to_utf8 {
ux_latin1_to_utf8(ux_latin1_bytes($_[0]));
}
## $astr = utf8_to_ascii($u8str)
sub utf8_to_ascii {
ux_utf8_to_ascii(ux_utf8_bytes($_[0]));
}
## $l1str = utf8_to_latin1($u8str)
sub utf8_to_latin1 {
ux_utf8_to_latin1(ux_utf8_bytes($_[0]));
}
## $destr = utf8_to_latin1_de($u8str)
sub utf8_to_latin1_de {
ux_utf8_to_latin1_de(ux_utf8_bytes($_[0]));
}
## $destr = utf8_to_utf8_de($u8str)
sub utf8_to_utf8_de {
utf8::upgrade(my $s = ux_utf8_to_latin1_de(ux_utf8_bytes($_[0])));
return $s;
}
##======================================================================
## Exports: finish
##======================================================================
1;
__END__
# Below is stub documentation for your module. You'd better edit it!
=head1 NAME
Unicruft - Perl interface to the unicruft transliteration library
=head1 SYNOPSIS
use Unicruft;
$libversion = Unicruft::library_version();
$u8str = Unicruft::latin1_to_utf8($l1str);
$astr = Unicruft::utf8_to_ascii($u8str);
$l1str = Unicruft::utf8_to_latin1($u8str);
$l1str = Unicruft::utf8_to_latin1_de($u8str);
$u8str = Unicruft::utf8_to_utf8_de($u8str);
=head1 DESCRIPTION
The perl Unicruft package provides a perl interface to the
libunicruft library, which is itself derived in part from
the Text::Unidecode perl module.
=head2 EXPORTS
Nothing is exported by default, but
the Unicruft module support the following export tags:
=over 4
=item :std
Standard conversion functions (those without a "ux_" prefix)
=item :guts
Low-level conversion functions (those with a "ux_" prefix).
=item :all
All conversion functions exported by :std and :guts.
=back
=head2 HIGH-LEVEL CONVERSION FUNCTIONS
=head3 library_version
Returns the version string of the unicruft C library against which
this perl module was compiled.
=head3 latin1_to_utf8
$u8str = Unicruft::latin1_to_utf8($l1str);
Converts the Latin-1 (ISO-8859-1) string $l1str to UTF-8.
This task is better accomplished either with perl's utf8::upgrade() function
or the perl Encode module; it is included here only for completeness' sake.
$l1str may be either a byte-string or a perl-native UTF-8 string (i.e. a scalar with the SvUTF8 flag set).
The returned string $u8str will have its UTF-8 flag set.
=head3 utf8_to_ascii
$astr = Unicruft::utf8_to_ascii($u8str);
Approximate the UTF-8 string $u8str as 7-bit ASCII.
This is basically just a (fast) re-implementation of Text::Unidecode::unidecode($u8str).
$u8str may be either a byte-string (assumed to contain a valid UTF-8 byte sequence)
or a perl-native UTF-8 string (i.e. a scalar with the SvUTF8 flag set).
The returned string $astr will have its UTF-8 flag cleared
(although this is pretty arbitrary here, since 7-bit ASCII is also valid UTF-8).
=head3 utf8_to_latin1
$l1str = Unicruft::utf8_to_latin1($u8str);
Approximate the UTF-8 string $u8str as 8-bit Latin-1 (ISO-8859-1).
$u8str may be either a byte-string (assumed to contain a valid UTF-8 byte sequence)
or a perl-native UTF-8 string (i.e. a scalar with the SvUTF8 flag set).
The returned string $l1str will have its UTF-8 flag cleared.
=head3 utf8_to_latin1_de
$l1str = Unicruft::utf8_to_latin1_de($u8str);
Approximate the UTF-8 string $u8str as 8-bit Latin-1 (ISO-8859-1) using only
characters which occur in contemporary German orthography.
$u8str may be either a byte-string (assumed to contain a valid UTF-8 byte sequence)
or a perl-native UTF-8 string (i.e. a scalar with the SvUTF8 flag set).
The returned string $l1str will have its UTF-8 flag cleared.
=head3 utf8_to_utf8_de
$u8str = Unicruft::utf8_to_utf8_de($u8str);
Approximate the UTF-8 string $u8str as 8-bit-safe UTF-8 using only
characters which occur in contemporary German orthography. Really just a wrapper for:
utf8::upgrade(my $s = Unicruft::utf8_to_latin1_de($u8str));
return $s;
=head2 LOW-LEVEL UTILITY FUNCTIONS
The following functions are available, but not expected to be
of much use to the casual user.
=head3 ux_latin1_bytes
$bytes = ux_latin1_bytes($string);
Returns an latin-1 encoded byte string representing its argument.
Respects perl UTF-8 flag.
=head3 ux_utf8_bytes
$bytes = ux_latin1_bytes($string);
Returns an UTF-8 encoded byte string representing its argument.
Respects perl UTF-8 flag.
=head2 LOW-LEVEL CONVERSION FUNCTIONS
For each conversion function C<X_to_Y>, there is an underlying
C<ux_X_to_Y> function which places stricter requirements on its
argument string (potentially downgrading it to a byte-string),
but which is slightly faster since no copying or perl-level
conditionals are required.
=head3 ux_latin1_to_utf8
Like L<latin1_to_utf8()|/latin1_to_utf8>, but requires its argument to be a Latin-1-encoded byte string.
=head3 ux_utf8_to_ascii
Like L<utf8_to_ascii()|/utf8_to_ascii>, but requires its argument to be a UTF-8-encoded byte string.
=head3 ux_utf8_to_latin1
Like L<utf8_to_latin1()|/utf8_to_latin1>, but requires its argument to be a UTF-8-encoded byte string.
=head3 ux_utf8_to_latin1_de
Like L<utf8_to_latin1_de()|/utf8_to_latin1_de>, but requires its argument to be a UTF-8-encoded byte string.
=head1 SEE ALSO
Text::Unidecode(3pm),
unicruft(1),
perl(1).
=head1 AUTHOR
Bryan Jurish E<lt>moocow@cpan.orgE<gt>
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2009-2013 by Bryan Jurish
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.14.2 or,
at your option, any later version of Perl 5 you may have available.
=cut