Text::WagnerFischer::Armenian - a variation on Text::WagnerFischer for Armenian-language strings
use Text::WagnerFischer::Armenian qw( distance ); use utf8; # for the Armenian characters in the source code print distance("ձեռն", "ձեռան") . "\n"; # "dzerrn -> dzerran"; prints 1 print distance("ձեռն", "ձերն") . "\n"; # "dzerrn -> dzern"; prints 0.5 print distance("կինք", "կին") . "\n"; # "kin" -> "kink'"; prints 0.5 my @words = qw( զօրսն Զորս զզօրսն ); my @distances = distance( "զօրս", @words ); print "@distances\n"; # "zors" -> "zorsn, Zors, zzorsn" # prints "0.5 0.25 1" # Change the cost of a letter case mismatch to 1 my $edit_values = [ 0, 1, 1, 1, 0.5, 0.5, 0.5 ], print distance( $edit_values, "ձեռն", "Ձեռն" ) . "\n"; # "dzerrn" -> "DZerrn"; prints 1
This module implements the Wagner-Fischer distance algorithm modified for Armenian strings. The Armenian language has a number of single-letter prefixes and suffixes which, while not changing the basic meaning of the word, function as definite articles, prepositions, or grammatical markers. These changes, and letter substitutions that represent vocalic equivalence, should be counted as a smaller edit distance than a change that is a normal character substitution.
The Armenian weight function recognizes four extra edit types:
/ a: x = y (cost for letter match) | b: x = - or y = - (cost for letter insertion/deletion) w( x, y ) = | c: x != y (cost for letter mismatch) | d: x = X (cost for case mismatch) | e: x ~ y (cost for letter vocalic equivalence) | f: x = (z|y|ts) && y = - (or vice versa) | (cost for grammatic prefix) | g: x = (n|k'|s|d) && y = - (or vice versa) \ (cost for grammatic suffix)
- distance( \@editweight, $string1, $string2, [ .. $stringN ] );
- distance( $string1, $string2, [ .. $stringN ] );
The main exported function of this module. Takes a list of two or more strings and returns the edit distance between the first string and each of the others. The "edit_distances" array is an optional first argument, with which users may override the default edit penalties, as described above.
- am_lc( $char )
A small utility function, useful for Armenian text. Returns the lowercase version of the character passed in.
There are many cases of Armenian word equivalence that are not perfectly handled by this; it is meant to be a rough heuristic for comparing transcriptions of handwriting. In particular, multi-letter suffixes, and some orthographic equivalence e.g "o" -> "aw", are not handled at all.
This package is free software and is provided "as is" without express or implied warranty. You can redistribute it and/or modify it under the same terms as Perl itself.
Tara L Andrews, firstname.lastname@example.org