NAME

Apache::Wyrd::Services::Index - Metadata index for word/data search engines

SYNOPSIS

    my $init = {
      file => '/var/lib/Wyrd/pageindex.db',
      strict => 1,
      attributes => [qw(author text subjects)],
      maps => [qw(subjects)]
    };
    my $index = Apache::Wyrd::Services::Index->new($init);

    my @subject_is_foobar = $index->word_search('foobar', 'subjects');

    my @pages =
      $index->word_search('+musthaveword -mustnothaveword
        other words to search for and add to results');
    foreach my $page (@pages) {
      print "title: $$page{title}, author: $$page{author};
    }
    
    my @pages = $index->parsed_search('(this AND that) OR "the other"');
    foreach my $page (@pages) {
      print "title: $$page{title}, author: $$page{author};
    }

DESCRIPTION

General purpose Index object for retrieving a variety of information on a class of objects. The objects can have any type, but must implement at a minimum the Apache::Wyrd::Interfaces::Indexable interface.

The information stored is broken down into attributes. The main builtin (and not override-able) attributes are data, word, title, and description, as well as four internal attributes of reverse, timestamp, digest and count. Additional attributes are specified via the hashref argument to the new method (see below). There can be only 254 total attributes, unless reversemaps are turned on, in which case all map attributes count as two attributes.

Attributes are of two types, either regular or map, and these relate to the main index, id. A regular attribute stores information on a one-id-to-one-attribute basis, such as title or description. A map attribute provides a reverse lookup, such as words in a document, or subjects covered by documents, such as documents with the word "foo" in them or items classified as "bar". One builtin map exists, word which reverse-indexes every word of the attribute data.

The Index is meant to be used as a storage for meta-data about web pages, and in this capacity, data and word provide the exact match and word-search capacity respectively.

The internal attributes of digest and timestamp are also used to determine whether the information for the item is fresh. It is assumed that testing a timestamp is faster than producing a digest, and that a digest is faster to produce than re-indexing a document, so a check to these two criteria is made before updating an entry for a given item. See update_entry. The count attribute keeps the total word-count for an indexed item, for use in balancing the relative value of returned results from a word-search.

The information in the Index is stored in a Berkeley DB, using the BerkeleyDB::Btree perl module. Because of concurrence of usage between different Apache demons in a pool of servers, it is important that this be a reasonably current version of BerkeleyDB which supports locking and read-during-update. This module was developed using Berkeley DB v. 3.3-4.1 on Darwin and Linux. Your results may vary.

When used with Berkeley DB versions above 4, Index will invoke concurrency and not locking.

Use with vast amounts of large documents is not recommended, but a reasonably large (hundreds of 1000-word pages) web site can be indexed and searched reasonably quickly(TM) on most cheap servers as of this writing. All hail Moore's Law.

METHODS

(format: (returns) name (arguments after self))

(Apache::Wyrd::Services::Index) new (hashref)

Create a new Index object, creating the associated DB file if necessary. The index is configured via a hashref argument. Important keys for this hashref:

file

Absolute path and filename for the DB file. Must be writable by the Apache process.

strict

Die on errors. Default 1 (yes).

quiet

If not strict, be quiet in the error log about problems. Use at your own risk.

concurrency

Use the underlying Concurrent Data Store application via BerkeleyDB when using Sleepycat Berkeley DB version 4 or higher. Default 0 (no).

transactions

Use the underlying transaction support via BerkeleyDB when using Sleepycat Berkeley DB version 4 or higher. Default 0 (no).

Note that concurrency and transactions are mutually exclusive options. If neither is specified, locking is used instead, to prevent separate Apache processes from trouncing each others' updates.

bigfile

By default (0), the data attribute is stored in the same DB file as the rest of the data. If the argument to this option is 1, the data attribute of the indexed objects is stored in a separate DB file. This may allow some lookups to be performed faster. The name of this file is based on the file attribute, above, by adding "_big" at the end of the filename, but before the ".db" extension, if present.

reversemaps

By default (0), when an indexed item is changed, it's mapped elements (like the words) are purged from every word entry. This is usually very CPU-intensive. This option tracks a reverse index on the map so that this purge can be done as quickly as possible. However, it doubles the space used to store mapped attributes, causing an overall, but usually smaller, speed decrease.

dirty

This is another potential speed increase, off (0) by default. When a purge is required, the data is not removed from the mapped attributes. Rather, a new reference is made for the entry and the previous references are removed. If, however, map data in the Index object is accessed directly via the db method and not through search/word_search, erroneous data will result unless "nameless" data is removed from the results. Therefore, reversemaps is the preferred method, if not the fastest.

attributes

Arrayref of attributes other than the default to use. For every attribute foo, an index_foo method should be implemented by the object being indexed. The value returned by this method will be stored under the attribute foo.

maps

Arrayref of which attributes to treat as maps. Any attribute that is a map must also be included in the list of attributes.

(void) delete_index (void)

Zero all data in the index and open a new one.

(scalar) update_entry (Apache::Wyrd::Interfaces::Indexable ref)

Called by an indexable object, passing itself as the argument, in order to update it's entry in the index. This method calls index_foo for every attribute foo in the index, storing that value under the attribute entry for that object. The function always returns a message about the process.

update_entry will always check index_timestamp and index_digest. If the stored value and the returned value agree on either attribute, the index will not be updated. This behavior can be overridden by returning a true value from method force_update.

Index will also check for an index_runtime_flags method and call it to determine if the indexed object is attempting to modify the behavior of the update during the process of updating for debugging purposes. Currently, it recognizes the following flags:

debug/nodebug

Turn on debugging messages for the course of this update, even if debug is not specified in the arguments to new.

nodata

Avoid processing, word-mapping, and storing the data attribute.

(hashref) entry_by_name (scalar)

Given the value of an name attribute, returns a hashref of all the regular attributes stored for a given entry.

(scalar) clean_html (scalar)

Given a string of HTML, this method strips out all tags, comments, etc., and returns only clean lowercase text for breaking down into tokens.

(array) word_search (scalar, [scalar])

return entries matching tokens in a string within a given map attribute. As map attributes store one token, such as a word, against which all entries are indexed, the string is broken into tokens before processing, with commas and whitespaces delimiting the tokens unless they are enclosed in double quotes.

If a token begins with a plus sign (+), results must have the word, with a minus sign, (-) they must not. These signs can also be placed left of phrases enclosed by double quotes.

Results are returned in an array of hashrefs ranked by "score". The attribute "score" is added to the hash, meaning number of matches for that given entry. All other regular attributes of the indexable object are values of the keys of each hash returned.

The default map to use for this method is 'word'. If the optional second argument is given, that map will be used.

(array) search (scalar, [scalar])

Alias for word_search. Required by Apache::Wyrd::Services::SearchParser.

(array) parsed_search (scalar, [scalar])

Same as word_search, but with the logical qualifiers AND, OR, NOT and DIFF. More complex searches can be accomplished, at a cost of reduced speed proportional to the complexity of the logical phrase. See Apache::Wyrd::Services::SearchParser for a description of this type of search.

AUTHOR

Barry King <wyrd@nospam.wyrdwright.com>

SEE ALSO

Apache::Wyrd

General-purpose HTML-embeddable perl object

Apache::Wyrd::Interfaces::Indexable

Methods to be implemented by any item that wants to be indexed.

Apache::Wyrd::Services::SearchParser

Parser for handling logical searches (AND/OR/NOT/DIFF).

LICENSE

Copyright 2002-2007 Wyrdwright, Inc. and licensed under the GNU GPL.

See LICENSE under the documentation for Apache::Wyrd.