NAME

Unicode::Diacritic::Strip - strip diacritics from Unicode text

SYNOPSIS

use utf8;
use Unicode::Diacritic::Strip ':all';
my $in = 'àÀâÂäçéÉèÈêÊëîïôùÙûüÜがぎぐげご';
print strip_diacritics ($in), "\n";
print fast_strip ($in), "\n";

produces output

aAaAaceEeEeEeiiouUuuUかきくけこ
aAaAaceEeEeEeiiouUuuUがぎぐげご

(This example is included as synopsis.pl in the distribution.)

VERSION

This documents Unicode::Diacritic::Strip version 0.14 corresponding to git commit b7ac4488df75b33bfbf0ace7b8eb2b81b2bf52a8 released on Wed Dec 7 12:16:07 2022 +0900.

DESCRIPTION

This module offers two ways to remove diacritics from Unicode text. One of them, "strip_diacritics", uses the Unicode decompositions to break the characters down. The other one, "fast_strip", is a faster alternative based on a hash of alphabetical characters with and without diacritics. There is also "strip_alphabet", which is the same as "strip_diacritics", but it also returns a list of what characters were changed.

FUNCTIONS

strip_diacritics

my $stripped = strip_diacritics ($text);

Strip diacritics from $text. The diacritics are as defined by the Unicode Character Database. See Unicode::UCD.

strip_alphabet

my ($stripped, $swaps) = strip_alphabet ($text);

Strip diacritics from $text in the same way as "strip_diacritics", and also return the alphabet of diacritic to non-diacritic characters as a hash reference.

use utf8;
use FindBin '$Bin';
use Unicode::Diacritic::Strip 'strip_alphabet';
my $stuff = '89. ročník udílení Oscarů';
my ($out, $list) = strip_alphabet ($stuff);
for my $k (keys %$list) {
    print "$k was converted to $list->{$k}\n";
}

produces output

č was converted to c
ů was converted to u
í was converted to i

(This example is included as strip-alphabet.pl in the distribution.)

This was added to the module in version 0.08. Prior to that it was in another module called Unicode::StripDiacritics which I wrote as a duplicate of this module, but fortunately hadn't released to CPAN.

fast_strip

my $stripped = fast_strip ($text);

Rapidly strip alphabetical Unicode characters to the nearest plain ASCII equivalents. This is just a big list of characters and a substitution statement which zaps them into ASCII. It also contains a few other things like the thorn character and a ligature.

use utf8;
use FindBin '$Bin';
use Unicode::Diacritic::Strip 'fast_strip';
my $unicode = 'Bjørn Łódź';
print fast_strip ($unicode), "\n";

produces output

Bjorn Lodz

(This example is included as ask.pl in the distribution.)

This was added to the module in version 0.07. It has been in service for several years at the following website: http://www.sljfaq.org/cgi/e2k.cgi for converting the user's inputs into the closest English equivalent. It was changed from a tr to a substitution in version 0.12.

DEPENDENCIES

Unicode::UCD

EXPORTS

Nothing is exported by default. The functions "strip_diacritics", "strip_alphabet", and "fast_strip" are exported on demand. A tag :all exports all the functions from the module.

BUGS

Test failures on Perl 5.14

The test failures on CPAN testers for version 0.08 of this module and Perl version 5.14, such as these, containing the error string

perl: hv.c:2663: S_unshare_hek_or_pvn: Assertion `he->shared_he_he.hent_hek == hek' failed.

are due to a bug in that version of Perl, and are completely beyond my control. Unicode::Diacritic::Strip is a pure Perl module with no XS components. (I have tried to contact the tester responsible for these reports with no success, due to the registered email address bouncing.)

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.

To install Unicode::Diacritic::Strip, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Unicode::Diacritic::Strip

CPAN shell

perl -MCPAN -e shell
install Unicode::Diacritic::Strip

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

NAME

SYNOPSIS

VERSION

DESCRIPTION

FUNCTIONS

strip_diacritics

strip_alphabet

fast_strip

SEE ALSO

CPAN modules

Web pages

DEPENDENCIES

EXPORTS

BUGS

AUTHOR

COPYRIGHT & LICENCE

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

VERSION

DESCRIPTION

FUNCTIONS

strip_diacritics

strip_alphabet

fast_strip

SEE ALSO

CPAN modules

Web pages

DEPENDENCIES

EXPORTS

BUGS

AUTHOR

COPYRIGHT & LICENCE

Module Install Instructions

Keyboard Shortcuts