Text::Levenshtein::Flexible - XS Levenshtein distance calculation with bounds and costs
use Text::Levenshtein::Flexible;
Yet another Levenshtein module written in C, but a tad more flexible than the rest.
This module uses code from PostgreSQL's levenshtein distance function to provide the following features on top of plain distance calculation as it is done by Levenshtein::XS and others:
Nothing is exported by default.
The following functions can be exported upon request, e.g.:
use Text::Levenshtein::Flexible qw( levenshtein levenshtein_l_all );
levenshtein
levenshtein_c
levenshtein_l
levenshtein_lc
levenshtein_l_all
levenshtein_lc_all
The functions listed under "Exportable" consitute the module's procedural API. Neither the names nor the huge parameter lists are particularly pretty so the OO interface is usually recommended.
Plain Levenshtein distance calculation between the two strings $src and $dst. Always returns an integer. If the strings are too long (currently there is a hard-coded limit of 255 characters), the function may die(), so call it in an eval block if this is a possibility.
$src
$dst
die()
Distance between the two strings $src and $dst using the specified costs for insertion, deletion and substitution respectively. Always returns an integer unless it dies.
Distance between $src and $dst unless it is bigger than $max_distance (think _limit!), in which case undef is returned. May die just like the other functions.
$max_distance
_l
undef
Distance between $src and $dst using the specified costs, up to $max_distance,
For an array @dst of strings, return all that are up to $max_distance from $src. The result is a list of 2-element arrays consisting of string-distance pairs. To get a list of strings sorted by distance:
@dst
map { $_->[0] } sort { $a->[1] <=> $b->[1] } levenshtein_l_all(2, "bar", "foo", "blah", "baz");
Note that since the *_all functions were converted to XS as well, this function delegates to the OO version internally to avoid too much XS code duplication, so the OO interface is preferable for this in any case.
*_all
For an array @dst of strings, return all that are up to $max_distance from $src when using the specified costs as in levenshtein_c. The result is the same as for levenshtein_l_all and the remark about the OO version applies equally here.
Note that there is no levenshtein_all() function because it is trivial to write using map.
levenshtein_all()
map
The OO API will usually be more convenient except for trivial calculations because it allows to specify limits and costs once and pass only variable data to object methods. Being implemented in C/XS it is just as fast as the procedural one, or faster in the case of the list functions.
All four constructor arguments are optional but must be defined if they are used, i.e. you have to specify a number for $max_distance if you want to use the costs. Pass 1 for costs and some number over 255 times the largest of the cost values for $max_distance (passing something significantly bigger doesn't hurt, in case the hardcoded limit for calculations should grow some day) if you don't care.
Just for orthogonality, this does the same as levenshtein().
levenshtein()
Just like levenshtein_c() but using the previously specified costs.
levenshtein_c()
levenshtein_l()'s modern brother.
levenshtein_l()
The nicer variant of levenshtein_lc().
levenshtein_lc()
Not quite as ugly but otherwise equivalent to levenshtein_l_all().
levenshtein_l_all()
Where levenshtein_lc_all() gets really nasty, this does the same in a saner way.
levenshtein_lc_all()
Of course there's no distance_all() method either.
distance_all()
According to a few completely made-up benchmarks, Text::Levenshtein::Flexible is at least as fast as either Text::Levenshtein::XS or Text::Fuzzy (Core i5 920) and between 25% and 48% faster on some systems (Phenom II X6 1090T). For pure 8-bit character sets, Text::LevenshteinXS is usually a tad faster, but it can't deal with multibyte characters at all. A small benchmark script is included to test on your system, I'd be interested to hear about any unexpectedly good or bad performance.
Text::Levenshtein::XS Text::LevenshteinXS Text::Fuzzy
Dont even bother with anything else unless you're more interested in the algorithm than in practical applications as the algorithm is one of the better examples for something reasonably efficient in C that Perl is terrible at.
To find this module's lastest updates that are not on CPAN yet, check https://github.com/mbethke/Text-Levenshtein-Flexible
All the credit for speed and algorithmic cleverness goes to Joe Conway and Volkan Yazici who wrote the bulk of this module's code, originally for PostgreSQL.
Matthias Bethke, <matthias@towiski.de>
Copyright (C) 2014-2018 by Matthias Bethke
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available. Significant portions of the code are (C) PostgreSQL Global Development Group and The Regents of the University of California. All modified versions must retain the file COPYRIGHT included in the distribution.
To install Text::Levenshtein::Flexible, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Levenshtein::Flexible
CPAN shell
perl -MCPAN -e shell install Text::Levenshtein::Flexible
For more information on module installation, please visit the detailed CPAN module installation guide.