The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Changes for version 0.20_01 - 2007-02-26

  • KinoSearch 0.20 is a major rewrite, adding many new features. It also breaks backwards compatibility in a number of ways.
  • Two key features, UTF-8 support and custom sorting, were not possible to implement while preserving backwards compatibility. Once the decision was made to proceed with them, breaking all existing installations, it made little sense to proceed by half measures, so the API has been given a significant overhaul.
  • KinoSearch has always carried an "alpha code" warning; it is being invoked for this release. While it will continue to carry the "alpha" warning for a short while longer, the point of jamming so many changes into one release is to cause disruption only once; once the code in 0.20 proves itself, hopefully no more backwards incompatible changes will be needed any time soon.
  • New behaviors:
    • KinoSearch now uses UTF-8 for all input and output, throughout the entire library. This affects many classes, but particularly those under Analysis, Highlight, and QueryParser.
    • The default scoring algorithm has changed subtly -- aggressive per-field boosting is no longer important or even desirable. The old behavior is available from KinoSearch::Contrib::LongFieldSim.
  • New public classes:
    • KinoSearch::Schema
    • KinoSearch::Schema::Field
    • KinoSearch::InvIndex
    • KinoSearch::Analysis::Token
    • KinoSearch::Search::RangeFilter
    • KinoSearch::Search::SortSpec
    • KinoSearch::Search::Similarity
    • KinoSearch::Contrib::LongFieldSim
  • New documentation:
    • KinoSearch::Docs::NFS
  • Removed classes:
    • KinoSearch::Document::Doc
    • KinoSearch::Document::Field
    • KinoSearch::Search::Hit
  • Renamed classes:
    • KinoSearch::Store::InvIndex => KinoSearch::Store::Folder
    • KinoSearch::Store::FSInvIndex => KinoSearch::Store::FSFolder
    • KinoSearch::Store::RAMInvIndex => KinoSearch::Store::RAMFolder
  • Updated documentation:
    • KinoSearch
    • KinoSearch::Docs::DevGuide
    • KinoSearch::Docs::FileFormat
    • KinoSearch::Docs::Tutorial
  • Classes with API changes:
    • KinoSearch::InvIndexer o new() - Args changed.
      • create - Removed.
      • analyzer - Removed.
      • lock_id - Added.
      • o spec_field() - Removed. o new_doc() - Removed. o add_doc() - Args changed.
        • Takes a hashref rather than a Doc object.
        • Accepts optional labeled param 'boost'.
      • o delete_docs_by_term() - Removed. o delete_by_term() - Added. (Behavior differs subtly from delete_docs_by_term()).
    • KinoSearch::Searcher o new() - args changed.
      • analyzer - Removed.
      • o search() - Now calls Hits->seek before returning Hits object. Args changed.
        • offset - Added.
        • num_wanted - Added.
        • sort_spec - Added.
    • KinoSearch::Search::Hits o Now comes pre-seeked, courtesy of changes to Searcher. o seek() - No longer triggers new number crunching if requested values can be accomodated using results of prior search. o fetch_hit() - Removed. o create_excerpts() - Now puts multiple excerpts under $hit->{excerpts} rather than one under $hit->{excerpt}.
    • KinoSearch::Search::MultiSearcher o new() - Args changed.
      • schema - Added.
      • analyzer - Removed.
    • KinoSearch::Highlight::Highlighter o new() - Args changed.
      • fields - Added.
      • excerpt_length - Now specified in characters rather than bytes.
      • excerpt_field - Removed.
      • pre_tag - Removed.
      • post_tag - Removed.
    • KinoSearch::QueryParser::QueryParser o new() - Args changed.
      • schema - Added.
      • default_field - Removed.
      • analyzer - No longer required -- now used to override schema.
    • KinoSearch::Analysis::TokenBatch o new() - Args changed.
      • text - Added.
      • o next() - Returns a Token instead of a boolean. o reset() - Added. o add_many_tokens() - Added. o set_text(), get_text(), set_start_offset(), get_start_offset(), set_end_offset(), get_end_offset(), set_pos_inc(), get_pos_inc - All removed.
  • Internal changes:
    • Large-scale refactoring has taken place. The most significant changes are...
    • OO framework imposed on C code via boilerplater.pl, with KinoSearch::Util::Obj as the base class.
    • Charmonizer added.
    • perlapi functions and data structures replaced whenever possible.
    • Lots of classes, especially under KinoSearch::Index, reorganized around Schema and SegInfo.
    • Many tests added, removed, or revised to accomodate changes in the main library code.
    • C code moved to dedicated files.
    • Build.PL custom code moved to buildlib/KinoSearchBuild.pm
  • File Format:
    • Significantly redesigned. The most visible change is that the segments file is now encoded using YAML rather than an arbitrary binary format.
    • Old indexes cannot be read and must be regenerated.
  • Locking
    • write.lock files now located in the index directory rather than under /tmp.
    • Commit locks are no longer needed due to file format changes.
    • Stale write locks are now removed without warning.

Documentation

Generate boilerplate OO code for KinoSearch
dump the contents of an index
Hacking/debugging KinoSearch.
Overview of invindex file format.
Managing invindexes on NFS.
Sample indexing and search applications.

Modules

Module::Build subclass for KinoSearch
Search engine library.
Base class for analyzers.
Convert input to lower case.
Multiple analyzers in series.
Reduce related words to a shared root.
Suppress a "stoplist" of common words.
A collection of tokens.
Customizable tokenizing.
Similarity optimized for long fields.
Encode excerpted text.
Format highlighted bits within excerpts.
Create and highlight excerpts.
Encode a few HTML entities.
Surround highlight bits with tags.
String of text associated with a field.
An inverted index.
Build inverted indexes.
Transform a string into a Query object.
User-created specification for an inverted index.
Define a field's behavior.
Match boolean combinations of Queries.
Access search results.
Aggregate results from multiple searchers.
Match ordered list of Terms.
Base class for search queries.
Build a filter based on results of a query.
Filter search results by range of values.
Connect to a remote SearchServer.
Make a Searcher remotely accessible.
Calculate how closely two things match.
Specify a custom sort order for search results.
Match individual Terms.
Execute searches.
File System implementation of Folder.
Abstract class representing a directory.
In-memory Folder.
Class-building utility.
Namespace pollution.

Provides

in lib/KinoSearch/Index/CompoundFileReader.pm
in lib/KinoSearch/Index/CompoundFileWriter.pm
in lib/KinoSearch/Index/DelDocs.pm
in lib/KinoSearch/Index/DocReader.pm
in lib/KinoSearch/Index/DocVector.pm
in lib/KinoSearch/Index/DocWriter.pm
in lib/KinoSearch/Index/IndexFileNames.pm
in lib/KinoSearch/Index/IndexReader.pm
in lib/KinoSearch/Index/MultiReader.pm
in lib/KinoSearch/Index/MultiTermDocs.pm
in lib/KinoSearch/Index/MultiTermList.pm
in lib/KinoSearch/Index/PostingsWriter.pm
in lib/KinoSearch/Index/SegInfo.pm
in lib/KinoSearch/Index/SegInfos.pm
in lib/KinoSearch/Index/SegReader.pm
in lib/KinoSearch/Index/SegTermDocs.pm
in lib/KinoSearch/Index/SegTermList.pm
in lib/KinoSearch/Index/SegWriter.pm
in lib/KinoSearch/Index/TermDocs.pm
in lib/KinoSearch/Index/TermInfo.pm
in lib/KinoSearch/Index/TermList.pm
in lib/KinoSearch/Index/TermListCache.pm
in lib/KinoSearch/Index/TermListReader.pm
in lib/KinoSearch/Index/TermListWriter.pm
in lib/KinoSearch/Index/TermVector.pm
in lib/KinoSearch/Index/TermVectorsReader.pm
in lib/KinoSearch/Index/TermVectorsWriter.pm
in lib/KinoSearch/Search/BooleanClause.pm
in lib/KinoSearch/Search/BooleanScorer.pm
in lib/KinoSearch/Search/BooleanQuery.pm
in lib/KinoSearch/Search/MultiSearcher.pm
in lib/KinoSearch/Search/FieldDoc.pm
in lib/KinoSearch/Search/FieldDocCollator.pm
in lib/KinoSearch/Search/HitCollector.pm
in lib/KinoSearch/Search/HitQueue.pm
in lib/KinoSearch/Search/PhraseScorer.pm
in lib/KinoSearch/Search/PhraseQuery.pm
in lib/KinoSearch/Search/ScoreDoc.pm
in lib/KinoSearch/Search/Scorer.pm
in lib/KinoSearch/Search/Searchable.pm
in lib/KinoSearch/Search/SortCollector.pm
in lib/KinoSearch/Search/SortedHitQueue.pm
in lib/KinoSearch/Search/TermScorer.pm
in lib/KinoSearch/Search/TermQuery.pm
in lib/KinoSearch/Search/TopDocCollector.pm
in lib/KinoSearch/Search/TopDocs.pm
in lib/KinoSearch/Search/Weight.pm
in lib/KinoSearch/Store/FSFileDes.pm
in lib/KinoSearch/Store/FileDes.pm
in lib/KinoSearch/Store/InStream.pm
in lib/KinoSearch/Store/Lock.pm
in lib/KinoSearch/Store/OutStream.pm
in lib/KinoSearch/Store/RAMFileDes.pm
in lib/KinoSearch/Util/BitVector.pm
in lib/KinoSearch/Util/ByteBuf.pm
in lib/KinoSearch/Util/CClass.pm
in lib/KinoSearch/Util/DynVirtualTable.pm
in lib/KinoSearch/Util/Hash.pm
in lib/KinoSearch/Util/IntMap.pm
in lib/KinoSearch/Util/Obj.pm
in lib/KinoSearch/Util/PriorityQueue.pm
in lib/KinoSearch/Util/SortExternal.pm
in lib/KinoSearch/Util/StringHelper.pm
in lib/KinoSearch/Util/ToStringUtils.pm
in lib/KinoSearch/Util/VArray.pm
in lib/KinoSearch/Util/VerifyArgs.pm
in lib/KinoSearch/Util/ViewByteBuf.pm
in lib/KinoSearch/Util/YAML.pm

Examples