The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

The norm_decoder caches the 256 possible byte => float pairs, obviating the need to call decode_norm over and over for a scoring implementation that knows how to use it.

NAME

KinoSearch::Search::Similarity - Calculate how closely two things match.

SYNOPSIS

    # ./MySimilarity.pm
    package MySimilarity;

    sub length_norm { 
        my ( $self, $num_tokens ) = @_;
        return $num_tokens == 0 ? 1 : log($num_tokens) + 1;
    }

    # ./MySchema.pm
    package MySchema;
    use base qw( KinoSearch::Schema );
    use MySimilarity;
    
    sub similarity { MySimilarity->new }

DESCRIPTION

KinoSearch uses a close approximation of boolean logic for determining which documents match a given query; then it uses a variant of the vector-space model for calculating scores. Much of the match used when calculating these scores is encapsulated within the Similarity class.

Similarity objects are are used internally by KinoSearch's indexing and scoring classes. They are assigned using KinoSearch::Schema and KinoSearch::Schema::FieldSpec.

Only one method is publicly exposed at present.

SUBCLASSING

To build your own Similarity implmentation, provide a new implementation of length_norm() under a new class name. The constructor will inherit the class name properly.

Similarity is implemented as a C-struct object, so you can't add any member variables to it.

METHODS

length_norm

    my $multiplier = $sim->length_norm($num_tokens);

After a field is broken up into terms at index-time, each term must be assigned a weight. One of the factors in calculating this weight is the number of tokens that the original field.

Typically, we assume that the more tokens in a field, the less important any one of them is -- so that, e.g. 5 mentions of "Kafka" in a short article are given more heft than 5 mentions of "Kafka" in an entire book. The default implementation of length_norm expresses this using an inverted square root.

However, the inverted square root has a tendency to reward very short fields highly, which isn't always appropriate for fields you expect to have a lot of tokens on average. See KinoSearch::Contrib::LongFieldSim for a discussion.

COPYRIGHT

Copyright 2005-2007 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.

See KinoSearch version 0.20_01.