OurNet::FuzzyIndex - Inverted search for double-byte characters
use OurNet::FuzzyIndex; my $idxfile = 'test.idx'; # Name of the database file my $pagesize = undef; # Page size (twice of an average record) my $cache = undef; # Cache size (undef to use default) my $subdbs = 0; # Number of child dbs; 0 for none # Initiate the DB from scratch unlink $idxfile if -e $idxfile; my $db = OurNet::FuzzyIndex->new($idxfile, $pagesize, $cache, $subdbs); # Index a record: key = 'Doc1', content = 'Some text here' $db->insert('Doc1', 'Some text here'); # Alternatively, parse the content first with different weights my %words = $db->parse("Some other text here", 5); %words = $db->parse_xs("Some more texts here", 2, \%words); # Then index the resulting hash with 'Doc2' as its key $db->insert('Doc2', %words); # Perform a query: the 2nd argument is the 'exact match' flag my %result = $db->query('search for some text', $MATCH_FUZZY); # Combine the result with another query %result = $db->query('more please', $MATCH_NOT, \%result); # Dump the results; note you have to call $db->getkey each time foreach my $idx (sort {$result{$b} <=> $result{$a}} keys(%result)) { $val = $result{$idx}; print "Matched: ".$db->getkey($idx)." (score $val)\n"; } # Set database variables $db->setvar('variable', "fetch success!\n"); print $db->getvar('variable'); # Get all records: the optional 0 says we want an array of keys print "These records are indexed:\n"; print join(',', $db->getkeys(0)); # Alternatively, get it with its internal index number my %allkeys = $db->getkeys(1);
OurNet::FuzzyIndex implements a simple consecutive-letter indexing mechanism specifically designed for multi-byte encoding maps, e.g. big-5 or utf8.
It uses DB_File to create an associative mapping from each character to its consecutive one, utilizing DB_BTREE's duplicate key feature to speed up the query time. Its scoring algorithm is also geared to reduce redundant word's impact on the query's result.
This module also supports a distributed databases option, which optimizes each query to access only a small portion of database.
Although this module currently only supports the Big5 encoding internally, you could override the parse.c module for extensions, or add your own translation maps.
The constructor method; normally only needs the first argument.
Parses $content into two-word chunks, stored as keys in %words, with values equal to their occurrence counts multipled by $weight (defaults to 1). May also be invoked as a normal function without $self.
$content
%words
$weight
$self
Returns the hash (or hash reference in scalar context) representing the parsed words and frequency.
Same as parse(), but implemented in XS.
parse()
Insert an entry, stored in $content as pre-parsed text, or in %words as a parsed hash. The $key is the name of the entry in the database.
$key
Returns the database ID of the newly created entry.
Perform a query on the database represented by $self; $query contains a free-form query string. The type of query is specified by $flag, as one of the constants below:
$query
$flag
Match the query string with fuzzy scoring heuristics.
Match the exact string $query.
Match each individual characters fuzzily, in addition to normal fuzzy matching.
Only matches entries that has none of the phrases in the query string.
The %match hash, if specified, contains the result of a previous query(), and indicates that this is a subquery limited by the previous search.
%match
query()
Returns the hash (or hash reference in scalar context) containing the matched entry IDs as keys, and their scores as values.
Synchronize the in-memory records into the disk.
Sets a user-defined variable in the database. Such variables does not affect operations on the database.
Returns the value of a previously set variable, or undef if no such variable exists.
undef
Get all variables beginning with $partial; returns an array of the variable names, or a hash with the variable values as hash values if if $wanthash is specified.
$partial
$wanthash
Returns the name of the entry with <$seq> as the ID, or undef if there is no such entry. Usually called after a query() to fetch the matched entries.
Find the ID of the entry with the name $key; the reverse operation of getkey().
getkey()
Delete the entry with name $key.
Delete the entry with the ID $seq. This function's name is a bit of a misnomer; sorry about that.
$seq
Return all entry names as an array, or as a hash with their IDs as hash values if if $wanthash is specified.
Private function to store an internal variable to the database. Do not call this directly.
The query() function uses a time-consuming callback function _parse_q() to parse the query string; it is expected to be changed to a simple function that returns the whole processed list. (Fortunately, most query strings won't be long enough to cause significant difference.)
_parse_q()
The MATCH_EXACT flag is misleading; FuzzyIndex couldn't tell if a query matches the content exactly from the info stored in the index file alone. You are encouraged to write your own grep-like post filter.
Internal handling of locale/unicode mappings
Boolean / selective search using combined MATCH_* flags
Fix bugs concerning sub_dbs, or deprecate them altogether
Use Lingua::ZH::TaBE for better word-segmenting algorithms
fzindex, fzquery, OurNet::ChatBot
Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.
Copyright 2001, 2003 by Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html
To install OurNet::FuzzyIndex, copy and paste the appropriate command in to your terminal.
cpanm
cpanm OurNet::FuzzyIndex
CPAN shell
perl -MCPAN -e shell install OurNet::FuzzyIndex
For more information on module installation, please visit the detailed CPAN module installation guide.