The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

KinoSearch::Docs::Tutorial - Sample indexing and search applications.

OVERVIEW

Most people start off with KinoSearch by copying the supplied sample applications for indexing and searching an HTML presentation of the US Constitution and adapting them to their needs.

This tutorial explores how the sample apps work. The materials can be found in the sample directory at the root of the KinoSearch distribution:

    sample/USConSchema.pm    # custom KinoSearch::Schema subclass
    sample/invindexer.plx    # indexing app
    sample/search.cgi        # search app
    sample/us_constitution   # html documents

Define the structure of your index in a subclass of KinoSearch::Schema

Before KinoSearch can write to an index, or properly interpret the contents of an existing index, it needs to know how the index's data is structured. Setting this up is roughly analogous to the step of defining a table in SQL, and is accomplished by subclassing KinoSearch::Schema.

The most important thing a Schema tells you is what fields are present and how they're defined, information which is communicated via the %FIELDS hash and subclasses of KinoSearch::Schema::FieldSpec. For our search of the US Constitution, we'll use three fields: title, content, and url.

    our %FIELDS = (
        title   => 'KinoSearch::Schema::FieldSpec',
        content => 'KinoSearch::Schema::FieldSpec',
        url     => 'USConSchema::UnIndexedField',    # custom subclass
    );

title and content will be indexed (the default), so that they can be searched, but url won't because our urls won't contain any information that we think will help people find what they're looking for.

    package USConSchema::UnIndexedField;
    use base qw( KinoSearch::Schema::FieldSpec );
    sub indexed {0}

The Schema subclass itself, which follows the field definitions in USConSchema.pm, supplies an analyzer() subroutine:

    package USConSchema; 
    use base qw( KinoSearch::Schema );
    use KinoSearch::Analysis::PolyAnalyzer;

    sub analyzer {
        return KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' );
    }

An Analyzer in KinoSearch is a object which transforms text from one form to another, possibly breaking it into smaller chunks, or performing additional filtering. For USConSchema, we've selected a general-purpose PolyAnalyzer, which breaks text into words, ignores all punctuation save apostrophes, lowercases everything so we get case-insensitive matching, and performs stemming so that a search for "senator" will also match documents containing the plural form "senators".

Locating USConSchema.pm

USConSchema.pm needs to be placed in a location where both your indexing script and your search script can find it. Putting it somewhere outside the cgi-bin directory is best, but if you run into permissions issues, dumping it into cgi-bin will work -- just don't leave it there if you ever add any sensitive information you wouldn't want a mis-configured server to reveal to potential attackers.

Note: It's crucial that the Schema that you use at search-time be identical to the one used at index time. KinoSearch will get very confused if you change up the list of fields, for example.

Create an index and add content.

Now that we've created an abstract set of rules defining how our index is structured, we need to create an actual index and add documents to it.

Since our collection is very small, we'll just overwrite the index every time, by choosing the Schema factory method clobber (instead of open):

    my $invindexer = KinoSearch::InvIndexer->new(
        invindex => USConSchema->clobber('/path/to/invindex'),
    );

The bulk of the indexing script is dedicated to general tasks such as reading files and parsing html. The KinoSearch code itself doesn't occupy a lot of space, and can be summed up, "create an InvIndexer, add documents to it, and call finish()".

Note that a proper indexer for html documents would not rely on quick-n-dirty regular expressions for stripping tags as this one does for the sake of brevity -- it would use a dedicated parsing module such as HTML::Parser.

search.cgi

At search-time, we open, rather than clobber our invindex:

    my $searcher = KinoSearch::Searcher->new(
        invindex => USConSchema->open('/path/to/invindex'),
    );

search.cgi contains a fair amount of code, but as with invindexer.plx, the KinoSearch portion only occupies a fraction of the space.

Experiments

Try a different analyzer

Try swapping out the PolyAnalyzer in USConSchema for a Tokenizer:

    package USConSchema;
    use base qw( KinoSearch::Schema );
    use KinoSearch::Analysis::Tokenizer;
    sub analyzer { return KinoSearch::Analysis::Tokenizer->new }

Regenerate the index, then try searching. Note that searches are no longer case-insensitive, and that searches for "Senate", "Senator", and "Senators" all now return distinct result sets.

Boost individual documents

Modify invindexer.plx to use document boosts:

    my $doc_boost = $url =~ /amend/ ? 1000 : 1;
    $invindexer->add_doc( \%doc, boost => $doc_boost );

Now Amendments should score higher than Articles most of the time.

Change Field definition

Try setting the url field so that it gets indexed.

    our %FIELDS = (
        title   => 'KinoSearch::Schema::FieldSpec',
        content => 'KinoSearch::Schema::FieldSpec',
        url     => 'KinoSearch::Schema::FieldSpec', # will be indexed, now
    );

Now try a search for "html" or "us_constitution".

COPYRIGHT

Copyright 2005-2007 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.

See KinoSearch version 0.20.