NAME

MMM::Text::Search - Perl module for indexing and searching text files and web objects

SYNOPSIS

  use MMM::Text::Search;
	  
  my $srch = new MMM::Text::Search {	#for indexing...
	#index main file location...  
		IndexPath => "/tmp/myindex.db",
	#local files... (optional)
		FileMask  => '(?i)(\.txt|\.htm.?)$',
		Dirs	  => [ "/usr/doc", "/tmp" ] ,
		FollowSymLinks => 0|1, (default = 0)
	#web objects... (optional)
		URLs	  => [ "http://localhost/", ... ],
		Level	  => recursion-level (0=unlimited)		
	#common options...		
		IgnoreLimit =>	0.3,   (default = 2/3)
		Verbose => 0|1				
  	};
  
  $srch->start_indexing_session();
	
  $srch->commit_indexing_session();
  
  $srch->index_default_locations();
        
  $srch->index_content( { title =>   '...', 
		    	  content=>  '...', 
		    	  id =>      '...'  } );
	 
  $srch->makeindex;
       (Obsolete.) 


	
	

  my $srch = new MMM::Text::Search (  #for searching....
		  "/tmp/myindex.db", verbose_flag );
  
  my $hashref = $srch->query("pizza","ciao", "-pasta" );  
  my $hashref = $srch->advanced_query("(pizza OR ciao) AND NOT pasta");  

  $srch->errstr()	# returns last error 
			# (only query syntax-errors for the moment being)

  
  $srch->dump_word_stats(\*FH)	

DESCRIPTION

  • Indexing

    When a session is closed the following files will have been created (assuming IndexPath = /path/myindex.db, see constructor):

    /path/myindex.db	     word index database
    /path/myindex-locations.db   filename/URL database
    /path/myindex-titles.db	     html title database
    /path/myindex.stopwords	     stop-words list
    /path/myindex.filelist	     readable list of indexed files/URLs
    /path/myindex.deadlinks	     broken http links
    
    [... lots of important things missing ... ]

    start_indexing_session() starts session.

    commit_indexing_session() commits and closes current session.

    index_default_locations() indexes all files and URLs specified on construction.

    index_content() pushes content into indexing engine. Argument must have the following structure

    { title =>   '...', content=>  '...', id =>      '...'  }

    makeindex() is obsolete. Equivalent to: $srch->start_indexing_session(); $srch->index_default_locations(); $srch->commit_indexing_session();

    dump_word_stats(\*FH) dumps all words sorted by occurence frequency using FH file handle (or STDOUT if no parameter is specified). Stop-words get a frequency value of 1.

  • Searching

    Both query() and advanced_query() return a reference to a hash with the following structure:

    (
     ignored  => [ string, string, ... ], # ignored words
     searched => [ string, string, ... ], # words searched for
     entries    => [  hashref, hashref, ... ] # list of records 
    					# found
     )

    The 'entries' element is a reference to an array of hashes, each having the following structure:

    	(
     	 location => string,  # file path or URL or anything
    	 score    => number,  # score 
    	 title    => string   # HTML title		 
    	)

NOTES

Note on implementation: The technique used for indexing is substantially derived from that exposed by Tim Kientzle on Dr. Dobbs magazine.

BUGS

Many, I guess.

AUTHOR

Max Muzi <maxim@comm2000.it>

SEE ALSO

perl(1).

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 501:

'=item' outside of any '=over'

Around line 564:

You forgot a '=back' before '=head1'