NAME
OurNet::FuzzyIndex - Inverted search for double-byte characters
SYNOPSIS
use
OurNet::FuzzyIndex;
my
$idxfile
=
'test.idx'
;
# Name of the database file
my
$pagesize
=
undef
;
# Page size (twice of an average record)
my
$cache
=
undef
;
# Cache size (undef to use default)
my
$subdbs
= 0;
# Number of child dbs; 0 for none
# Initiate the DB from scratch
unlink
$idxfile
if
-e
$idxfile
;
my
$db
= OurNet::FuzzyIndex->new(
$idxfile
,
$pagesize
,
$cache
,
$subdbs
);
# Index a record: key = 'Doc1', content = 'Some text here'
$db
->insert(
'Doc1'
,
'Some text here'
);
# Alternatively, parse the content first with different weights
my
%words
=
$db
->parse(
"Some other text here"
, 5);
%words
=
$db
->parse_xs(
"Some more texts here"
, 2, \
%words
);
# Then index the resulting hash with 'Doc2' as its key
$db
->insert(
'Doc2'
,
%words
);
# Perform a query: the 2nd argument is the 'exact match' flag
my
%result
=
$db
->query(
'search for some text'
,
$MATCH_FUZZY
);
# Combine the result with another query
%result
=
$db
->query(
'more please'
,
$MATCH_NOT
, \
%result
);
# Dump the results; note you have to call $db->getkey each time
foreach
my
$idx
(
sort
{
$result
{
$b
} <=>
$result
{
$a
}}
keys
(
%result
)) {
$val
=
$result
{
$idx
};
"Matched: "
.
$db
->getkey(
$idx
).
" (score $val)\n"
;
}
# Set database variables
$db
->setvar(
'variable'
,
"fetch success!\n"
);
$db
->getvar(
'variable'
);
# Get all records: the optional 0 says we want an array of keys
"These records are indexed:\n"
;
join
(
','
,
$db
->getkeys(0));
# Alternatively, get it with its internal index number
my
%allkeys
=
$db
->getkeys(1);
DESCRIPTION
OurNet::FuzzyIndex implements a simple consecutive-letter indexing mechanism specifically designed for multi-byte encoding maps, e.g. big-5 or utf8.
It uses DB_File to create an associative mapping from each character to its consecutive one, utilizing DB_BTREE's duplicate key feature to speed up the query time. Its scoring algorithm is also geared to reduce redundant word's impact on the query's result.
This module also supports a distributed databases option, which optimizes each query to access only a small portion of database.
Although this module currently only supports the Big5 encoding internally, you could override the parse.c module for extensions, or add your own translation maps.
METHODS
OurNet::FuzzyIndex->new($dbfile, [ $pagesize, $cachesize, $split, $submin, $submax ])
The constructor method; normally only needs the first argument.
$self->parse($content, [$weight], [\%words])
Parses $content
into two-word chunks, stored as keys in %words
, with values equal to their occurrence counts multipled by $weight
(defaults to 1). May also be invoked as a normal function without $self
.
Returns the hash (or hash reference in scalar context) representing the parsed words and frequency.
$self->parse_xs($content, [$weight], [\%words])
Same as parse()
, but implemented in XS.
$self->insert($key, [$content | \%words])
Insert an entry, stored in $content
as pre-parsed text, or in %words
as a parsed hash. The $key
is the name of the entry in the database.
Returns the database ID of the newly created entry.
$self->query($query, $flag, [\%match])
Perform a query on the database represented by $self
; $query
contains a free-form query string. The type of query is specified by $flag
, as one of the constants below:
- MATCH_FUZZY (default)
-
Match the query string with fuzzy scoring heuristics.
- MATCH_EXACT
-
Match the exact string
$query
. - MATCH_PART
-
Match each individual characters fuzzily, in addition to normal fuzzy matching.
- MATCH_NOT
-
Only matches entries that has none of the phrases in the query string.
The %match
hash, if specified, contains the result of a previous query()
, and indicates that this is a subquery limited by the previous search.
Returns the hash (or hash reference in scalar context) containing the matched entry IDs as keys, and their scores as values.
$self->sync()
Synchronize the in-memory records into the disk.
$self->setvar($varname, $value)
Sets a user-defined variable in the database. Such variables does not affect operations on the database.
$self->getvar($varname)
Returns the value of a previously set variable, or undef
if no such variable exists.
$self->getvars($partial, [$wanthash])
Get all variables beginning with $partial
; returns an array of the variable names, or a hash with the variable values as hash values if if $wanthash
is specified.
$self->getkey($seq)
Returns the name of the entry with <$seq> as the ID, or undef
if there is no such entry. Usually called after a query()
to fetch the matched entries.
$self->findkey($key)
Find the ID of the entry with the name $key
; the reverse operation of getkey()
.
$self->delete($key)
Delete the entry with name $key
.
$self->delkey($seq)
Delete the entry with the ID $seq
. This function's name is a bit of a misnomer; sorry about that.
$self->getkeys([$wanthash])
Return all entry names as an array, or as a hash with their IDs as hash values if if $wanthash
is specified.
$self->_store($varname, $value)
Private function to store an internal variable to the database. Do not call this directly.
CAVEATS
The query()
function uses a time-consuming callback function _parse_q()
to parse the query string; it is expected to be changed to a simple function that returns the whole processed list. (Fortunately, most query strings won't be long enough to cause significant difference.)
The MATCH_EXACT flag is misleading; FuzzyIndex couldn't tell if a query matches the content exactly from the info stored in the index file alone. You are encouraged to write your own grep-like post filter.
TODO
Internal handling of locale/unicode mappings
Boolean / selective search using combined MATCH_* flags
Fix bugs concerning sub_dbs, or deprecate them altogether
Use Lingua::ZH::TaBE for better word-segmenting algorithms
SEE ALSO
fzindex, fzquery, OurNet::ChatBot
AUTHORS
Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.
COPYRIGHT
Copyright 2001, 2003 by Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.