NAME
Text::WagnerFischer::Armenian - a variation on Text::WagnerFischer for Armenian-language strings
SYNOPSIS
distance(
"ձեռն"
,
"ձեռան"
) .
"\n"
;
# "dzerrn -> dzerran"; prints 1
distance(
"ձեռն"
,
"ձերն"
) .
"\n"
;
# "dzerrn -> dzern"; prints 0.5
distance(
"կինք"
,
"կին"
) .
"\n"
;
# "kin" -> "kink'"; prints 0.5
my
@words
=
qw( զօրսն Զորս զզօրսն )
;
my
@distances
= distance(
"զօրս"
,
@words
);
"@distances\n"
;
# "zors" -> "zorsn, Zors, zzorsn"
# prints "0.5 0.25 1"
# Change the cost of a letter case mismatch to 1
my
$edit_values
= [ 0, 1, 1, 1, 0.5, 0.5, 0.5 ],
distance(
$edit_values
,
"ձեռն"
,
"Ձեռն"
) .
"\n"
;
# "dzerrn" -> "DZerrn"; prints 1
DESCRIPTION
This module implements the Wagner-Fischer distance algorithm modified for Armenian strings. The Armenian language has a number of single-letter prefixes and suffixes which, while not changing the basic meaning of the word, function as definite articles, prepositions, or grammatical markers. These changes, and letter substitutions that represent vocalic equivalence, should be counted as a smaller edit distance than a change that is a normal character substitution.
The Armenian weight function recognizes four extra edit types:
/ a: x = y (cost
for
letter match)
| b: x = - or y = - (cost
for
letter insertion/deletion)
w( x, y ) = | c: x != y (cost
for
letter mismatch)
| d: x = X (cost
for
case mismatch)
| e: x ~ y (cost
for
letter vocalic equivalence)
| f: x = (z|y|ts) && y = - (or vice versa)
| (cost
for
grammatic prefix)
| g: x = (n|k'|s|d) && y = - (or vice versa)
\ (cost
for
grammatic suffix)
SUBROUTINES
- distance( \@editweight, $string1, $string2, [ .. $stringN ] );
- distance( $string1, $string2, [ .. $stringN ] );
-
The main exported function of this module. Takes a list of two or more strings and returns the edit distance between the first string and each of the others. The "edit_distances" array is an optional first argument, with which users may override the default edit penalties, as described above.
- am_lc( $char )
-
A small utility function, useful for Armenian text. Returns the lowercase version of the character passed in.
LIMITATIONS
There are many cases of Armenian word equivalence that are not perfectly handled by this; it is meant to be a rough heuristic for comparing transcriptions of handwriting. In particular, multi-letter suffixes, and some orthographic equivalence e.g "o" -> "aw", are not handled at all.
LICENSE
This package is free software and is provided "as is" without express or implied warranty. You can redistribute it and/or modify it under the same terms as Perl itself.
AUTHOR
Tara L Andrews, aurum@cpan.org