The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DBIx::TextSearch

SYNOPSIS

Database independent modules to index and search text/HTML files. Supports indexing local files and fetching files by HTTP and FTP.

 use DBIx::TextSearch;
 use DBIx::TextSearch::Pg; # to use postgresql - other drivers available

 $dbh = DBI->connect(...); # see the DBD documentation

 $index = DBIx::TextSearch->new($dbh,
                               'index_name',
                               {debug => 1});

 $index = DBIx::TextSearch->open($dbh,
                                 ' index_name',
                                 {debug => 1});

 # uri is file:/// ftp:// or http://
 $index->index_document(uri => $location);

 # $results is a ref to an array of hashrefs
 $results = $index->find_document(query => 'foo bar',
                                  parser => 'simple');
 $results = $index->find_document(query => 'foo and not bar',
                                  parser => 'advanced');

 foreach my $doc (@$results) {
     print "Title: ", $doc->{title}, "\n";
     print "Description: ", $doc->{description}, "\n";
     print "Location: ", $doc->{uri}, "\n";
 }

 $index->delete_document('http://localhost/foo.txt');

 $index->flush_index(); # clear the index

DESCRIPTION

DBIx::TextSearch consists of an abstraction layer (TextSearch.pm) providing a set of standard routines to index text and HTML files. These routines interface to a set of database specific routines (not separately documented) in much the same way as the perl DBI and DBD::foo modules do.

CURRENT DRIVERS

  • DBIx::TextSearch::Pg - Postgresql driver module

  • DBIx::TextSearch::DB2 - IBM DB2 support (untested)

  • DBIx::TextSearch::Sybase - Sybase and Microsoft SQL Server

METHODS

All methods return a true value on success, undef on failure.

new

 $index = DBIx::TextSearch->new($dbh,
                                'index_name',
                                {debug => 1});

Create a new index on the database referenced by $dbh. The database must exist.

Debug is an optional parameter, and will dump additional debugging information to STDOUT

open

 $index = DBIx::TextSearch->open($dbh,
                                 'index_name',
                                 {debug => 1});

Connect to an existing index, options as per new() above.

index_document

Given a file:/// http:// or ftp:// URI, fetch and index the document.

For each document, this method stores the document URI, the document title, a document description, keywords (HTML only from <meta name="keywords"), the document contents and the document's modification time. If the URI points to a html file, the document title is taken from the contents of the HTML <title> tag and the description is taken from the contents of <meta name="description">. The HTML tags are removed before finally storing the document. If the URI is plain text (i.e. not HTML), the title is the first non-blank line and the description is the next paragraph (terminated by 2 newlines)

index_document compares the file's modification time against the modification time for that URI stored in the index, and will only index a document if that document is not already in the index, or if the document is more recent than the indexed copy.

For file:/// URIs, you need to include the full (absolute) path.

FTP passwords

Pass the username and password in the ftp URI as shown here: ftp://user:password@foo.bar.com/wibble/barf.txt

Sample URIs

 file://usr/doc/HOWTO/en-html/index.html
 /usr/doc/HOWTO/en-html/index.html
 http://www.foo.bar.com/
 ftp://foo.bar.com/wibble/barf.txt # anonymous - uses local email
                                   # address as password
 ftp://user:password@foo.bar.com/wibble/barf.txt

find_document

This method takes 2 parameters:

query

A boolean query string as per Text::Query::ParseSimple or Text::Query::ParseAdvanced (an AltaVista style query)

parser

Either simple or advanced to use either Text::Query::ParseSimple or Text::Query::ParseAdvanced to parse the query.

find_document returns a reference to an array of hash references. The hash keys are URI, title, description.

The number of documents found by the last query is returned by $index->match()

To print information on all the documents matching a query, see this code:

 my $results = find_document("zot or grault");

 foreach my $doc (@$results) {
     print "Title: ", $doc->{title}, "\n";
     print "Description: ", $doc->{description}, "\n";
     print "Location: ", $doc->{uri}, "\n";
 }

 print $index->matches(), " results found";

flush_index

Remove all stored document data from the index, leaving the index tables intact.

delete_document

Given a URI, remove that document from the index.

SEE ALSO

Text::Query::ParseAdvanced, Text::Query::ParseSimple DBI

Interested in developing your own driver modules for DBIx::TextSearch?

The interfaces are all documented in DBIx::TextSearch::developing

AUTHOR

Stephen Patterson <steve@patter.mine.nu> http://patter.mine.nu/

CHANGELOG

0.3

  • Added mysql and DB2 support (DB2 is untested as I don't have access to any systems capable of creating DB2 databases).

0.2

  • Creating a new index checks exising table names to avoid overwriting them.

  • Switched from using timestamps to MD5 checksums to check if a document differs from the indexed version.

0.1

  • Initial release