The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Search::Kinosearch - Search Engine Library

DEPRECATED

Search::Kinosearch has been superseded by KinoSearch. Please use the new version.

SYNOPSIS

First, write an application to build a 'kindex' from your document collection. (A 'kindex' is a Kinosearch index.)

    use Search::Kinosearch::Kindexer;
    
    my $kindexer = Search::Kinosearch::Kindexer->new(
        -mainpath       => '/foo/bar/kindex',
        -temp_directory => '/foo/bar',
        );
    
    for my $field ('title', 'bodytext') {
        $kindexer->define_field(
            -name      => $field,
            -lowercase => 1,
            -tokenize  => 1,
            -stem      => 1,
            );
    }

    while (my ($title, $bodytext) = each %docs) {
        my $doc = $kindexer->new_doc( $title );
        
        $doc->set_field( title    => $title    );
        $doc->set_field( bodytext => $bodytext );
        
        $kindexer->add_doc( $doc );
    }
    
    $kindexer->generate;
    $kindexer->write_kindex;

Then, write a second application to search the kindex:

    use Search::Kinosearch::KSearch;
    use Search::Kinosearch::Query;
     
    my $ksearch = Search::Kinosearch::KSearch->new(
        -mainpath   => '/foo/bar/kindex',
        );
    
    my $query = Search::Kinosearch::Query->new(
        -string     => 'this AND NOT (that OR "the other thing")',
        -lowercase  => 1,
        -tokenize   => 1,
        -stem       => 1,
        );
    $ksearch->add_query( $query );
    $ksearch->process;
    
    while (my $hit = $ksearch->fetch_hit_hashref) {
        print "$hit->{title}\n";
    }

DESCRIPTION

Primary Features

  • Match 'any' or 'all' search terms

  • Match phrases

  • Boolean operators AND, OR, and AND NOT

  • Support for parenthetical groupings of arbitrary depth

  • Prepended +plus or -minus to require or negate a term

  • Sort results by relevance or by datetime

  • Stemming

  • Algorithmic selection of relevant excerpts

  • Hilighting of search terms in excerpts, even when stemmed

  • Fast, efficient algorithms for both indexing and searching

  • Works well with large or small document collections

  • High quality ranking algorithm based on term frequency / inverse document frequency (tf/idf)

General Introduction

Search::Kinosearch (hereafter abreviated 'Kinosearch') is a search engine library -- it handles the tricky, behind-the-scenes work involved in running a search application. It is up to you how to present the search results to the end-user.

Kinosearch has two main parts: Search::Kinosearch::Kindexer and Search::Kinosearch::KSearch (hereafter abbreviated 'Kindexer' and 'KSearch'). When you want to know which pages of a book are most relevant for a given subject (e.g. avocados, Aesop, Alekhine's Defense...) you look up the term in the book's index and it tells you which pages may be of interest to you; the 'kindex' produced by Kinosearch's Kindexer performs an analogous role -- using the interface tools provided by the KSearch module, you consult the kindex to find which documents within the document collection Kinosearch considers most relevant to your query.

[The Search::Kinosearch module itself doesn't do very much, and as an abstract base class, it does nothing on its own; this documentation is an overview which ties together the various components of the Kinosearch suite.]

The Kindexer thinks of your documents as database rows, each with as many fields as you define. HTML documents might be parsed, prepared, and fed to the Kindexer like so:

  1. Store the URL in a kindex field called 'url'.

  2. Store the text surrounded by the <title> tags in a kindex field called 'title';

  3. Isolate portions of the document that are not relevant to content (such as navigation panels or advertisements) and remove them.

  4. Strip all html tags.

  5. Store what's left in a field called 'bodytext'.

Most of the time, you will want to take advantage of three crucial functions performed by the Kindexer, executed on a per-field basis depending on the parameters passed to define_field():

-lowercase

This does exactly what you would expect - lc the text to be indexed (but leave the copy of the text to be stored intact). If you select the same option for your queries at search-time, your searches will be case-insensitive.

-tokenize

Tokenizing breaks up input text into an array of words (roughly speaking). If you tokenize the text "Dr. Strangelove, or How I Learned to Stop Worrying and Love the Bomb", then searches for "Strangelove" or "Bomb" will match. If you don't tokenize it, then only a search for the complete string "Dr. Strangelove, or How I Learned to Stop Worrying and Love the Bomb" will match.

-stem

Stemming reduces words to a root form. For instance, "horse", "horses", and "horsing" all become "hors" -- so that a search for 'horse' will also match documents containing 'horses' and 'horsing'. For more information, see the documentation for Lingua::Stem.

Once you have the completed the indexing phase, you will need a search application. Most often, this takes the form of a CGI script accessed via web browser. The interface design requirements of such an app will be familiar to anyone who's surfed the web.

Getting started

If you want to get started right away, copy the sample applications out of Search::Kinosearch::Tutorial, get them functioning, and then swap in some of your own material.

You may wish to consult the documentation for Search::Kinosearch::Kindexer, Search::Kinosearch::KSearch and Search::Kinosearch::Query before continuing on to the next section.

Fine Tuning Kinosearch

Performance Optimizations

The bottleneck in search applications is the ranking and sorting of a large number of relevant documents. Minimizing the time it takes to return results is a central design goal of Kinosearch.

The single most important optimization available for Kinosearch apps is to store either some or all of the kindex files in ram -- in particular, the files stored within the directory specified by the '-freqpath' parameter. These files contain all of the kindex data upon which search-time speed depends. Storing the other kindex files in ram won't hurt, but will not yield anywhere near the same benefit.

An additional search-time optimization is available when running Kinosearch applications under mod_perl. See the Search::Kinosearch::Kindex documentation for details.

Stoplists

A "stoplist" is collection of "stopwords": words which are common enough to be of little use when determining search results. For example, so many documents in English contain "the", "if", and "maybe" that it is best if they are blocked, not just at search time, but at index time. Removing them from the equation saves time and space, plus improves relevance.

By default, the Kindexer excludes a small list of stopwords from the kindex and KSearch reports when it detects any of them in a search query. It is possible to disable the stoplist or use a different one, if you so desire.

Phrase Matching Algorithm

Kinosearch's phrase matching implementation involves storing concatenated pairs of words in the kindex. This strategy yields a substantial performance benefit at search time (since the extremely fast hash-based algorithm used to scan for individual terms can be extended to detect phrases as well), but at the cost of increased kindex size. Blocks of text are broken into overlapping couplets, which are stored as individual terms in the kindex -- e.g. the text "brush your teeth" produces four kindex entries: "brush", "teeth", "brush your", and "your teeth" ("your" on its own is a stopword, so it is excluded from the kindex by default). If a user searches for the the specific phrase "brush your teeth", Kinosearch will return only documents which contain both "brush your" and "your teeth".

Ranking Algorithm

Kinosearch uses a variant of the well-established "Term Frequency / Inverse Document Frequency" weighting scheme. Explaining TF/IDF is beyond the scope of this documentation, but in a nutshell:

  • in a search for "skate park", documents which score well for the comparatively rare term "skate" will rank higher than documents which score well for the common term "park".

  • a 10-word text which has one occurrence each of both "skate" and "park" will rank higher than a 1000-word text which also contains one occurrence of each.

A web search for "tf idf" will turn up many excellent explanations of the algorithm.

Field Weights

Kinosearch will allow you to weight the title of a document more heavily than the bodytext -- or vice versa -- by assigning a weight to each field at either index-time or search-time. However, the multipliers don't have to be that large, because under TF/IDF a single occurrence of a word within the 10-word title will automatically contribute more to the score of a document than a single occurrence of the same word in the 1000-word bodytext. See the documentation for Kindexer's define_field() method for more details.

TO DO

  • Overhaul the API. Kinosearch's major classes have grown too large, and need to be reimplemented as extensible abstract base classes.

    For instance, the current version of Search::Kinosearch::Kindex is limited because it specifies particular file configurations and tie classes. It is being reworked as a a Perl complex data structure which represents only the abstract format of a kindex. Happily, the new version will work fine as an in-memory kindex on its own, speeding testing and development. Subclasses, such as the Search::Kinosearch::Kindex::FS, will be implemented primarily through the use of Perl's TIE mechanism.

  • Dispense with DB_File altogether and implement a composite index format.

  • Shrink proximity data. Right now it's solid packed 32 bit integers.

  • Enable customizable tokenizing and stemming.

  • Implement docrank.

  • Fix excerpt highlighting for phrases so that a search for 'we the people' doesn't cause every instance of 'the' to be highlighted. -- DONE in development version.

  • Complete support for UTF-8. At present, everything works so long as the encoding of all supplied files is ASCII compatible and is consistent across all files.

  • Add unicode normalization.

  • Add support for other encodings.

  • Implement -cache_level option for Search::Kinosearch::Kindexer, in order to allow user control over memory requirements. This feature is dependent on the implementation of -mem_threshold in Sort::External, being tested as of this writing.

  • Implement descending term weighting.

  • Add support for 64-bit timestamps.

  • Enable support for psuedo-stemming using Lingua::Spelling::Alternative for languages where no Snowball stemmer is available.

BUGS

  • Excerpt may be up to 20 characters longer than requested.

  • Spurious results can turn up in searches for phrases 3 terms or longer: for instance, a document containing "brush your hair and floss your teeth" will be returned in a search for '"brush your teeth"'. -- FIXED in development version.

FILES

Kinosearch requires several other CPAN distributions:

SEE ALSO

ACKNOWLEDGEMENTS

Chris Nandor has been helpful with debugging.

AUTHOR

Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com

COPYRIGHT

Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.