The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

KinoSearch::Docs::FileFormat - Overview of invindex file format.

OVERVIEW

It is not necessary to understand the current implementation details of KinoSearch's file format in order to use the library effectively, but it may be helpful if you are interested in tweaking for high performance, exotic usage, or debugging and development.

On a file system, all the files in an invindex exist in one, flat directory. Conceptually, the files have a hierarchical relationship: an invindex is made up of "segments", each of which is an independent inverted index, and each segment is made up of several subsections.

    [invindex]--|
                |-"segments_XXX.yaml"
                |-"write.lock"
                |
                |-[segments]------|
                                  |--[seg _0]--|
                                  |            |--[term list]
                                  |            |--[postings]
                                  |            |--[documents]
                                  |            |--[term vectors]
                                  |            |--[deletions]
                                  |
                                  |--[seg _1]--|
                                  |            |--[term list]
                                  |            |--[postings]
                                  |            |--[documents]
                                  |            |--[term vectors]
                                  |            |--[deletions]
                                  |
                                  |--[ ... ]---| 
                                

Write-once philosophy

All the files which belong to a given segment share a common prefix consisting of an underscore followed by a number in base 36: _0, _5m, _p9s2. Once a segment is finished and sent "live", its name is never re-used.

Segments are written once, and with the exception of the list of deleted document numbers, are never modified once written. Old segments become obsolete and can be removed when their data has been consolidated into new segments during the process of segment-merging and optimization. A fully-optimized index has only one segment.

The segments_XXX.yaml file also utilizes the unique-base-36-number naming convention -- "XXX" is a placeholder, so real files would be named e.g. segments_3r.yaml. The higher the number, the more recent the file.

There is only one file whose name gets re-used: write.lock.

Because all index file names other than write.lock are unique, there is no danger of overwriting an active file, as well as a greatly diminished need for file locking in general.

Top-level files

There are two "top-level" files which belong to the entire invindex rather than to a particular segment. Their contents are stored in human-readable YAML to facilitate spelunking and debugging.

segments_XXX.yaml

The segments_XXX.yaml file stores information about which segments are active and what they contain. It is updated every time the invindex changes.

When a new segment is being written, KinoSearch may put files into the invindex directory, but until a new segments_XXX.yaml file gets written, a Searcher reading the index won't know about them.

write.lock

Only one indexing process may safely modify the invindex at any given time. Processes reserve an invindex by laying claim to the write.lock file.

A segment's component parts

Each segment can be said to have five logical parts: term list, postings, document storage, term vectors, and deletions.

All segment files use non-human-readable binary formats, but they all adhere to the same simple pattern: records stacked end-to-end-to-end. Given unlimited memory, any segment file may be read into an array by running a single deserialization function against it over and over until the file ends (though for some of the simpler files, which are just solid arrays of bits or integers, a dedicated deserializer would be overkill).

Term List

A segment's Term List is housed in two files, one primary and one auxiliary.

  • _XXX.tl - Term List. Enumerates all the indexed terms in the segment.

  • _XXX.tlx - Term List IndeX. Houses periodic samples from the main term list -- typically, every 16th term.

Terms are ordered first by field number, then according to the field's sort order. The default sort order is determined by lexical comparison of term text, but may be overridden.

When a Searcher is launched, the entire .tlx file is decoded and cached as an in-memory array of Terms. To locate a term in the main term list, first a binary search is performed against the term list index array. The results of this search point to a small slice of the main on-disk term list file, which is then scanned linearly.

Postings

"Posting" is a technical term from the field of information retrieval, defined as a single instance of a one term indexing one document. If you are looking at the index in the back of a book, and you see that "freedom" is referenced on pages 8, 86, and 240, that would be three postings, which taken together form a "posting list". The same terminology applies to an index in electronic form.

The number of postings files per segment in KinoSearch varies: each segment has one such file per indexed field. The files follow the naming pattern _XXX.pYYY, where "_XXX" is the segment, and "YYY" is the field number. So, if you were indexing plain text files and you had two fields, only one of which was indexed...

    package MySchema::content;
    use base qw( KinoSearch::Schema::FieldSpec );
    
    package MySchema::filepath;
    use base qw( KinoSearch::Schema::FieldSpec );

    sub indexed {0}
    sub analyzed {0}
    
    package MySchema;
    ...

... there would only be 1 postings file per segment, "_XXX.p0", containing in this example the postings for the "content" field.

When a search is performed for a single term, first that term is looked up in the term list. If the term exists in a segment, the record in the term list will contain information about which postings file to look at and where to look.

The first thing any Posting record tells you is a document number. By iterating over all the Postings associated with a term, you can find all the documents that match that term, a process which is analogous to looking up page numbers in a book's index.

However, each Posting in KinoSearch typically contains other information in addition to document number.

  • doc_num - Document number.

  • field_boost A multiplier which applies to this term in this field, any position. By default, each position's boost is calculated based on field length in combination with document boost and field boost.

  • freq - The number of times the term appears in the document.

  • positions - The positions, specified by token number, that a term appears at in the document. If freq is 9, that implies that there will be 9 positions.

  • pos_boosts - For some applications, it makes sense to store boost per-position rather than apply the same boost to each position.

Documents

The document storage section is essentially an object-oriented database where the "original" documents added to an invindex are housed. It is organized into two files:

  • _XXX.ds - Document Storage. Serialized documents.

  • _XXX.dsx - Document Storage IndeX. A solid array of 64-bit integers. Each integer location corresponds to a document number, and the value at that location points at a file position in the _XXX.ds file.

When a document number turns up as a hit in a search and its associated document must be retrieved, KinoSearch look up the doc number in the _XXX.dsx file to find the file pointer, then seeks to that location in the _XXX.ds file and deserializes the document.

Not all information associated with the original document can be recovered at search-time. For instance, if a field is spec'd as indexed but not stored, its terms will be in the term list, but its original content will not be found along with the rest of the document in the relevant _XXX.ds file.

Term Vectors

The files which store Term Vectors, used for excerpting and highlighting, are organized similarly to the files used to store documents.

  • _XXX.tv - Term Vectors. 1 array of terms per document.

  • _XXX.tvx - Term Vector IndeX. As with the _XXX.dsx file, a solid array of 64-bit file pointers.

Deletions

When a document is "deleted" from a segment, it is not actually purged right away; it is merely marked as "deleted", via the .del file. The .del file contains a bit vector with one bit for each document in the segment; if bit #254 is set then document 254 is deleted, and if that document turns up in a search it will be masked out.

It is only when a segment's contents are rewritten to a new segment during the segment-merging process that deleted documents truly go away.

Deletions may be performed against a given segment several times during its lifespan, thus .del filenames use two base-36 numbers: _XXX_YYY.del. The first number corresponds to segment name, and the second to the file's "generation", so the file "_aa_1.del" would supersede "_aa_0.del" as a list of the deletions for segment "_aa".

Compound File

If you peer inside an index directory, you won't actually find any files with the extensions tl, tlx, ds, dsx, or pYYY, unless there is a live indexing process underway. What you will find is one file per segment named "_XXX.cf". (cf => Compound File)

To minimize the need for file descriptors at search-time, KinoSearch concatenates all per-segment files onto each other at the close of each indexing session. Information about where each file begins and ends is stored in segments_XXX.yaml. When the segment is opened for reading, a single file descriptor per compound file gets shared by multiple InStream objects.

Document Numbers

Document numbers in KinoSearch are ephemeral. They change every time a document gets moved from one segment to a new one during segment merging and optimization.

If you need to assign a primary key to each document, you need to create a field and populate it with an externally generated unique identifier.

A Typical Search

Here's a simplified version of how a search for "freedom" against a given segment plays out:

  1. The searcher asks the TermListIndex file, "Do you know anything about 'freedom'?" TermListIndex replies, "Can't say for sure, but if the TermList file does, 'freedom' is probably somewhere around byte 21008".

  2. The TermList tells the searcher "One moment, let me scan our records... Yes, we have 2 documents which contain 'freedom'. You'll find them in the _6.p2 Postings file starting at byte 66991."

  3. The Postings file says "Yep, we have 'freedom', all right! Document number 40 has 1 'freedom', and document 44 has 8. If you need to know more, like if any 'freedom' is part of the phrase 'freedom of speech', ask me about positions!

  4. If the searcher is only looking for 'freedom' in isolation, that's where it stops. It now knows enough to assign the documents scores against "freedom", with the 8-freedom document likely ranking higher than the single-freedom document.

COPYRIGHT

Copyright 2005-2007 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.

See KinoSearch version 0.20_01.