The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::AM::algorithm - How the Analogical Modeling algorithm works

VERSION

version 2.44

DESCRIPTION

First, the user must create a set of data items, with their outcomes, and some test items. All of these items are represented by feature vectors. These feature vectors are not created by the AM algorithm; they could be generated by hand or script, it matters not to AM. All the feature vectors must be of the same length, call it n.

The supracontextual lattice

AM requires the construction of a supracontextual lattice. It is merely a complete distributive lattice of sets called supracontexts, each one labeled with an integer in the range 0 to 2^n - 1. If a and b are labels of two supracontexts, then a & b = b (that's bitwise AND) iff the supracontext labeled by a is a superset of the supracontext labeled by b.

The supracontextual lattice starts out with every element being the empty set. The AM algorithm adds subcontexts to them one at a time.

Subcontexts

Subcontexts are also sets, and they are also labeled with an integer in the range 0 to 2^n - 1. The elements of the subcontexts are data items.

Example

Suppose that the test item has feature vector

('S', 'O', '0', 'S', 'R', '0', 'T', 'A')

and a data item has the feature vector

('P', 'Y', 'V', 'S', 'R', '0', 'T', 'a').

Compare the corresponding features, using a 1 for different and 0 for same. Then the binary number

0b11100001

is the label of the subcontext to which the data item belongs.

Filling the supracontextual lattice

The elements of the supracontexts are the subcontexts. If a subcontext has label a, then it will be an element of any supracontext whose label b satisfies b & a = a.

Thus, the subcontext labeled by 0b11100001 will belong to 16 different supracontexts. The labels of these supracontexts are found by replacing the 0s by 1s in all possible ways; the original 1s are left untouched.

Homogeneity and Heterogeneity

A supracontext is heterogeneous if the following two conditions hold:

  1. The supracontext contains two or more subcontexts.

  2. The data items contained in these subcontexts do not share a common outcome.

Otherwise, the supracontext is homogeneous. Only homogeneous supracontexts determine the possible outcomes.

The Analogical Set

The analogical set is a list of all data items that appear in the subcontexts in the homogeneous supracontexts. With each data item is associated a number which is one of the following:

Number of occurrences

This is merely the number of supracontexts containing the subcontext containing the data item.

Number of pointers

Assign to each supracontext a number representing the total number of data items in the subcontexts it contains. The number of pointers of a particular data item is the sum of the numbers assigned to the supracontexts containing the subcontext containing the data item. (Number of occurrences can be thought of as assigning 1 to each supracontext.)

Using pointers gives rise to gang effects: the ability of many data items less similar to the test item but appearing in the same subcontext to have more influence on the outcome than a few data items more similar to the test item.

With the analogical set in hand, one can then describe the likelihood of the various outcomes actually occurring by looking at the numbers assigned to its data items.

OVERVIEW

Exemplar-based modeling works as follows: there is a set of data items, each assigned an outcome, and there is a test item. The test item is compared with the data items; the result of this comparison tells what the possible outcomes of the test item are, along with their likelihoods.

Exemplar-based modeling is often contrasted with rule-based modeling. Note that in rule-based modeling, there can be only one possible outcome, unless the model is fudged by introducing probability. (Some types of exemplar-based modeling also give only one possible outcome.)

Analogical Modeling (AM) is one way to do the comparison and determine the outcome. Some of its salient features are as follows:

  • Exemplars that seem less similar to the test item than those that seem more similar can still have a magnified effect if there are many of them. This is known as the gang effect.

  • AM accounts for leakage.

    For instance, it is possible for someone to accidentally say "snew" instead of "snowed", in analogy with "know/knew", "grow/grew", "throw/threw", "blow/blew", etc. (I've never done this myself, though I know someone who has.) In rule-based modeling, this could never occur; in AM, this is predicted to occur, though with very low frequency.

AUTHOR

Theron Stanford <shixilun@yahoo.com>, Nathan Glenn <garfieldnate@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by Royal Skousen.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.