NAME
MMM::Text::Search - Perl module for indexing and searching text files and web objects
SYNOPSIS
use MMM::Text::Search;
my $srch = new MMM::Text::Search { #for indexing...
#index main file location...
IndexPath => "/tmp/myindex.db",
#local files... (optional)
FileMask => '(?i)(\.txt|\.htm.?)$',
Dirs => [ "/usr/doc", "/tmp" ] ,
FollowSymLinks => 0|1, (default = 0)
#web objects... (optional)
URLs => [ "http://localhost/", ... ],
Level => recursion-level (0=unlimited)
#common options...
IgnoreLimit => 0.3, (default = 2/3)
Verbose => 0|1
};
$srch->start_indexing_session();
$srch->commit_indexing_session();
$srch->index_default_locations();
$srch->index_content( { title => '...',
content=> '...',
id => '...' } );
$srch->makeindex;
(Obsolete.)
my $srch = new MMM::Text::Search ( #for searching....
"/tmp/myindex.db", verbose_flag );
my $hashref = $srch->query("pizza","ciao", "-pasta" );
my $hashref = $srch->advanced_query("(pizza OR ciao) AND NOT pasta");
$srch->errstr() # returns last error
# (only query syntax-errors for the moment being)
$srch->dump_word_stats(\*FH)
DESCRIPTION
Indexing
When a session is closed the following files will have been created (assuming IndexPath = /path/myindex.db, see constructor):
/path/myindex.db word index database /path/myindex-locations.db filename/URL database /path/myindex-titles.db html title database /path/myindex.stopwords stop-words list /path/myindex.filelist readable list of indexed files/URLs /path/myindex.deadlinks broken http links [... lots of important things missing ... ]
start_indexing_session() starts session.
commit_indexing_session() commits and closes current session.
index_default_locations() indexes all files and URLs specified on construction.
index_content() pushes content into indexing engine. Argument must have the following structure
{ title => '...', content=> '...', id => '...' }
makeindex() is obsolete. Equivalent to: $srch->start_indexing_session(); $srch->index_default_locations(); $srch->commit_indexing_session();
dump_word_stats(\*FH) dumps all words sorted by occurence frequency using FH file handle (or STDOUT if no parameter is specified). Stop-words get a frequency value of 1.
Searching
Both query() and advanced_query() return a reference to a hash with the following structure:
( ignored => [ string, string, ... ], # ignored words searched => [ string, string, ... ], # words searched for entries => [ hashref, hashref, ... ] # list of records # found )
The 'entries' element is a reference to an array of hashes, each having the following structure:
( location => string, # file path or URL or anything score => number, # score title => string # HTML title )
NOTES
Note on implementation: The technique used for indexing is substantially derived from that exposed by Tim Kientzle on Dr. Dobbs magazine.
BUGS
Many, I guess.
AUTHOR
Max Muzi <maxim@comm2000.it>
SEE ALSO
perl(1).
2 POD Errors
The following errors were encountered while parsing the POD:
- Around line 501:
'=item' outside of any '=over'
- Around line 564:
You forgot a '=back' before '=head1'