The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

OurNet::FuzzyIndex - Inverted search for double-byte characters

SYNOPSIS

    use OurNet::FuzzyIndex;

    my $idxfile  = 'test.idx'; # Name of the database file
    my $pagesize = undef;      # Page size (twice of an average record)
    my $cache    = undef;      # Cache size (undef to use default)
    my $subdbs   = 0;          # Number of child dbs; 0 for none

    # Initiate the DB from scratch
    unlink $idxfile if -e $idxfile;
    my $db = OurNet::FuzzyIndex->new($idxfile, $pagesize, $cache, $subdbs);

    # Index a record: key = 'Doc1', content = 'Some text here'
    $db->insert('Doc1', 'Some text here');

    # Alternatively, parse the content first with different weights
    my %words = $db->parse("Some other text here", 5);
    %words = $db->parse_xs("Some more texts here", 2, \%words);

    # Then index the resulting hash with 'Doc2' as its key
    $db->insert('Doc2', %words);

    # Perform a query: the 2nd argument is the 'exact match' flag
    my %result = $db->query('search for some text', $MATCH_FUZZY);

    # Combine the result with another query
    %result = $db->query('more please', $MATCH_NOT, \%result);

    # Dump the results; note you have to call $db->getkey each time
    foreach my $idx (sort {$result{$b} <=> $result{$a}} keys(%result)) {
        $val = $result{$idx};
        print "Matched: ".$db->getkey($idx)." (score $val)\n";
    }

    # Set database variables
    $db->setvar('variable', "fetch success!\n");
    print $db->getvar('variable');

    # Get all records: the optional 0 says we want an array of keys
    print "These records are indexed:\n";
    print join(',', $db->getkeys(0));

    # Alternatively, get it with its internal index number
    my %allkeys = $db->getkeys(1);

DESCRIPTION

OurNet::FuzzyIndex implements a simple consecutive-letter indexing mechanism specifically designed for multi-byte encoding maps, e.g. big-5 or utf8.

It uses DB_File to create an associative mapping from each character to its consecutive one, utilizing DB_BTREE's duplicate key feature to speed up the query time. Its scoring algorithm is also geared to reduce redundant word's impact on the query's result.

This module also supports a distributed databases option, which optimizes each query to access only a small portion of database.

Although this module currently only supports the Big5 encoding internally, you could override the parse.c module for extensions, or add your own translation maps.

METHODS

OurNet::FuzzyIndex->new($dbfile, [ $pagesize, $cachesize, $split, $submin, $submax ])

The constructor method; normally only needs the first argument.

$self->parse($content, [$weight], [\%words])

Parses $content into two-word chunks, stored as keys in %words, with values equal to their occurrence counts multipled by $weight (defaults to 1). May also be invoked as a normal function without $self.

Returns the hash (or hash reference in scalar context) representing the parsed words and frequency.

$self->parse_xs($content, [$weight], [\%words])

Same as parse(), but implemented in XS.

$self->insert($key, [$content | \%words])

Insert an entry, stored in $content as pre-parsed text, or in %words as a parsed hash. The $key is the name of the entry in the database.

Returns the database ID of the newly created entry.

$self->query($query, $flag, [\%match])

Perform a query on the database represented by $self; $query contains a free-form query string. The type of query is specified by $flag, as one of the constants below:

MATCH_FUZZY (default)

Match the query string with fuzzy scoring heuristics.

MATCH_EXACT

Match the exact string $query.

MATCH_PART

Match each individual characters fuzzily, in addition to normal fuzzy matching.

MATCH_NOT

Only matches entries that has none of the phrases in the query string.

The %match hash, if specified, contains the result of a previous query(), and indicates that this is a subquery limited by the previous search.

Returns the hash (or hash reference in scalar context) containing the matched entry IDs as keys, and their scores as values.

$self->sync()

Synchronize the in-memory records into the disk.

$self->setvar($varname, $value)

Sets a user-defined variable in the database. Such variables does not affect operations on the database.

$self->getvar($varname)

Returns the value of a previously set variable, or undef if no such variable exists.

$self->getvars($partial, [$wanthash])

Get all variables beginning with $partial; returns an array of the variable names, or a hash with the variable values as hash values if if $wanthash is specified.

$self->getkey($seq)

Returns the name of the entry with <$seq> as the ID, or undef if there is no such entry. Usually called after a query() to fetch the matched entries.

$self->findkey($key)

Find the ID of the entry with the name $key; the reverse operation of getkey().

$self->delete($key)

Delete the entry with name $key.

$self->delkey($seq)

Delete the entry with the ID $seq. This function's name is a bit of a misnomer; sorry about that.

$self->getkeys([$wanthash])

Return all entry names as an array, or as a hash with their IDs as hash values if if $wanthash is specified.

$self->_store($varname, $value)

Private function to store an internal variable to the database. Do not call this directly.

CAVEATS

The query() function uses a time-consuming callback function _parse_q() to parse the query string; it is expected to be changed to a simple function that returns the whole processed list. (Fortunately, most query strings won't be long enough to cause significant difference.)

The MATCH_EXACT flag is misleading; FuzzyIndex couldn't tell if a query matches the content exactly from the info stored in the index file alone. You are encouraged to write your own grep-like post filter.

TODO

  • Internal handling of locale/unicode mappings

  • Boolean / selective search using combined MATCH_* flags

  • Fix bugs concerning sub_dbs, or deprecate them altogether

  • Use Lingua::ZH::TaBE for better word-segmenting algorithms

SEE ALSO

fzindex, fzquery, OurNet::ChatBot

AUTHORS

Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.

COPYRIGHT

Copyright 2001, 2003 by Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html