Text::Transliterator::Unaccent - Compile a transliterator from Unicode tables, to remove accents from text
my $unaccenter = Text::Transliterator::Unaccent->new(script => 'Latin', wide => 0, upper => 0); $unaccenter->($string); my $map = Text::Transliterator::Unaccent->char_map(script => 'Latin'); my $descr = Text::Transliterator::Unaccent->char_map_descr();
This package compiles a transliteration function that will replace accented characters by unaccented characters. That function is fast, because it uses the builtin tr/.../.../ Perl operator; it is compact, because it only treats the Unicode subset that you need for your language; and it is complete, because it relies on the builtin Unicode character tables shipped with your Perl installation.
tr/.../.../
The algorithm for detecting accented characters is derived from the notion of compositions in Unicode; that notion is explained in perluniintro. Characters considered "accented" are the precomposed characters for which the Unicode canonical decomposition contains more than one codepoint; for such decompositions, the first codepoint is the unaccented character that will be mapped to the accented one. This definition seems to work well for the Latin script; I presume that it also makes sense for other scripts as well, but I'm not able to test.
my $unaccenter = Text::Transliterator::Unaccent->new(@range_description); # or my $unaccenter = Text::Transliterator::Unaccent->new(); # script => 'Latin'
Compiles a new 'unaccenter' function. The @range_description argument specifies which ranges of characters will be handled, and is comprised of pairs of shape :
@range_description
script => $unicode_script
$unicode_script is the name of a Unicode script, such as 'Latin', 'Greek' or 'Cyrillic'. For a complete list of unicode scripts, see
$unicode_script
perl -MUnicode::UCD=charscripts -e "print join ', ', keys %{charscripts()}"
block => $unicode_block
$unicode_block is the name of a Unicode block. For a complete list of Unicode blocks, see
$unicode_block
perl -MUnicode::UCD=charblocks -e "print join ', ', keys %{charblocks()}"
range => \@codepoint_ranges
@codepoint_ranges is a list of arrayrefs that contain start-of-range, end-of-range code point pairs.
@codepoint_ranges
wide => $bool
Decides if wide characters (i.e. characters with code points above 255) are kept or not within the map. The default is true.
upper => $bool
Decides if uppercase characters are kept or not within the map. The default is true.
lower => $bool
Decides if lowercase characters are kept or not within the map. The default is true.
The @range_description may contain a list of several scripts, blocks and/or ranges; all will get concatenated into a single correspondance map. If the list is empty, the default range is script => 'Latin'.
script => 'Latin'
The return value from that new method is actually a reference to a function, not an object. That function is called as
new
$unaccenter->(@strings);
and modifies every member of @strings in place, like the tr/.../.../ operator. The return value is the number of transliterated characters in the last member of @strings.
@strings
my $map = Text::Transliterator::Unaccent->char_map(@range_description);
Utility class method that returns a hashref of the accented characters in @range_description, mapped to their unaccented corresponding characters, according to the algorithm described in the introduction. The @range_description format is exactly like for the new() method.
new()
my $descr = Text::Transliterator::Unaccent->char_map_descr(@range_descr);
Utility class method that returns a textual description of the map generated by @range_descr.
@range_descr
Text::Unaccent is another unaccenter module, with a C and a Pure Perl version. It is based on iconv instead of Perl's internal Unicode tables, and therefore may produce slighthly different results. According to some experimental benchmarks, the C version of Text::Unaccent is faster than Text::Transliterator::Unaccent on short strings and on small number of calls, and slower on long strings or high number of calls (but this may be a side-effect of the fact that it returns a copy of the string instead of replacing characters in-place); however I am not able to give a predictable rule about which module is faster in which circumstances.
iconv
Text::Unaccent
Text::Transliterator::Unaccent
Text::StripAccents is a Pure Perl module. In only handles Latin1, and is several orders of magnitude slower because it does an internal split and join of the whole string.
Search::Tokenizer uses the present module for building an unaccent tokenizer.
unaccent
Laurent Dami, <dami@cpan.org>
<dami@cpan.org>
Please report any bugs or feature requests to bug-text-transliterator at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Transliterator. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-text-transliterator at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc Text::Transliterator::Unaccent
You can also look for information at:
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Text-Transliterator
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/Text-Transliterator
CPAN Ratings
http://cpanratings.perl.org/d/Text-Transliterator
Search CPAN
http://search.cpan.org/dist/Text-Transliterator/
Copyright 2010, 2017 Laurent Dami.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
To install Text::Transliterator, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Transliterator
CPAN shell
perl -MCPAN -e shell install Text::Transliterator
For more information on module installation, please visit the detailed CPAN module installation guide.