KinoSearch::Docs::Tutorial - Sample indexing and search applications.
Most people start off with KinoSearch by copying the supplied sample applications for indexing and searching an HTML presentation of the US Constitution and adapting them to their needs.
This tutorial explores how the sample apps work. The materials can be found in the sample directory at the root of the KinoSearch distribution:
sample
sample/USConSchema.pm # custom KinoSearch::Schema subclass sample/invindexer.plx # indexing app sample/search.cgi # search app sample/us_constitution # html documents
Before KinoSearch can write to an index, or properly interpret the contents of an existing index, it needs to know how the index's data is structured. Setting this up is roughly analogous to the step of defining a table in SQL, and is accomplished by subclassing KinoSearch::Schema.
The most important thing a Schema tells you is what fields are present and how they're defined, information which is communicated via the %FIELDS hash and subclasses of KinoSearch::Schema::FieldSpec. For our search of the US Constitution, we'll use three fields: title, content, and url.
%FIELDS
title
content
url
our %FIELDS = ( title => 'KinoSearch::Schema::FieldSpec', content => 'KinoSearch::Schema::FieldSpec', url => 'USConSchema::UnIndexedField', # custom subclass );
title and content will be indexed (the default), so that they can be searched, but url won't because our urls won't contain any information that we think will help people find what they're looking for.
package USConSchema::UnIndexedField; use base qw( KinoSearch::Schema::FieldSpec ); sub indexed {0}
The Schema subclass itself, which follows the field definitions in USConSchema.pm, supplies an analyzer() subroutine:
package USConSchema; use base qw( KinoSearch::Schema ); use KinoSearch::Analysis::PolyAnalyzer; sub analyzer { return KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' ); }
An Analyzer in KinoSearch is a object which transforms text from one form to another, possibly breaking it into smaller chunks, or performing additional filtering. For USConSchema, we've selected a general-purpose PolyAnalyzer, which breaks text into words, ignores all punctuation save apostrophes, lowercases everything so we get case-insensitive matching, and performs stemming so that a search for "senator" will also match documents containing the plural form "senators".
USConSchema.pm needs to be placed in a location where both your indexing script and your search script can find it. Putting it somewhere outside the cgi-bin directory is best, but if you run into permissions issues, dumping it into cgi-bin will work -- just don't leave it there if you ever add any sensitive information you wouldn't want a mis-configured server to reveal to potential attackers.
Note: It's crucial that the Schema that you use at search-time be identical to the one used at index time. KinoSearch will get very confused if you change up the list of fields, for example.
Now that we've created an abstract set of rules defining how our index is structured, we need to create an actual index and add documents to it.
Since our collection is very small, we'll just overwrite the index every time, by choosing the Schema factory method clobber (instead of open):
clobber
open
my $invindexer = KinoSearch::InvIndexer->new( invindex => USConSchema->clobber('/path/to/invindex'), );
The bulk of the indexing script is dedicated to general tasks such as reading files and parsing html. The KinoSearch code itself doesn't occupy a lot of space, and can be summed up, "create an InvIndexer, add documents to it, and call finish()".
Note that a proper indexer for html documents would not rely on quick-n-dirty regular expressions for stripping tags as this one does for the sake of brevity -- it would use a dedicated parsing module such as HTML::Parser.
At search-time, we open, rather than clobber our invindex:
my $searcher = KinoSearch::Searcher->new( invindex => USConSchema->open('/path/to/invindex'), );
search.cgi contains a fair amount of code, but as with invindexer.plx, the KinoSearch portion only occupies a fraction of the space.
search.cgi
invindexer.plx
Try swapping out the PolyAnalyzer in USConSchema for a Tokenizer:
package USConSchema; use base qw( KinoSearch::Schema ); use KinoSearch::Analysis::Tokenizer; sub analyzer { return KinoSearch::Analysis::Tokenizer->new }
Regenerate the index, then try searching. Note that searches are no longer case-insensitive, and that searches for "Senate", "Senator", and "Senators" all now return distinct result sets.
Modify invindexer.plx to use document boosts:
my $doc_boost = $url =~ /amend/ ? 1000 : 1; $invindexer->add_doc( \%doc, boost => $doc_boost );
Now Amendments should score higher than Articles most of the time.
Try setting the url field so that it gets indexed.
indexed
our %FIELDS = ( title => 'KinoSearch::Schema::FieldSpec', content => 'KinoSearch::Schema::FieldSpec', url => 'KinoSearch::Schema::FieldSpec', # will be indexed, now );
Now try a search for "html" or "us_constitution".
Copyright 2005-2007 Marvin Humphrey
See KinoSearch version 0.20.
To install KinoSearch, copy and paste the appropriate command in to your terminal.
cpanm
cpanm KinoSearch
CPAN shell
perl -MCPAN -e shell install KinoSearch
For more information on module installation, please visit the detailed CPAN module installation guide.