The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Search::InvertedIndex - A manager for inverted index maps

SYNOPSIS

  use Search::InvertedIndex;

  my $database = Search::InvertedIndex::DB::DB_File_SplitHash->new({
             -map_name => '/www/search-engine/databases/test-maps/test',
                                -multi => 4,
            -file_mode => 0644,
            -lock_mode => 'EX',
         -lock_timeout => 30,
       -blocking_locks => 0,
            -cachesize => 1000000,
        -write_through => 0, 
      -read_write_mode => 'RDWR';
        });

  my $scratch_database = Search::InvertedIndex::DB::DB_File_SplitHash->new({
             -map_name => '/www/search-engine/databases/test-maps/test',
                                -multi => 4,
            -file_mode => 0644,
            -lock_mode => 'EX',
         -lock_timeout => 30,
       -blocking_locks => 0,
            -cachesize => 1000000,
        -write_through => 0, 
      -read_write_mode => 'RDWR';
        });


  my $inv_map = Search::Inverted->new({ -database => $database, -scratch_database => $scratch_database });
  my $query = Search::InvertedIndex::Query->new(...);
  my $result = $inv_map->query({ -query => $query });

  my $update = Search::InvertedIndex::Update->new(...);
  my $result = $inv_map->update({ -update => $update });

  $inv_map->close;

DESCRIPTION

Provides the core of an inverted map based search engine. By mapping 'keys' to 'indexes' it provides ultra-fast look ups of all 'indexes' containing specific 'keys'. This produces highly scalable behavior where thousands, or even millions of records can be searched extremely quickly.

CHANGES

 1.00 1999.06.16 - Initial release

 1.01 1999.06.17 - Documentation fixes and fix to 'close' method in 
                   Search::InvertedIndex::DB::DB_File_SplitHash

 1.02 1999.06.18 - Major bugfix to locking system. 
                   Performance tweaking. Roughly 3x improvement.

 1.03 1999.06.30 - Documentation fixes.
 
 1.04 1999.07.01 - Documentation fixes and caching system bugfixes.

 1.05 1999.10.20 - Altered ranking computation on search results  

Public API

new($parm_ref);

Provides the interface for obtaining a new Search::InvertedIndex object for manipulating a inverted database.

Example 1:

 my $database = Search::InvertedIndex::DB::DB_File_SplitHash->new({
             -map_name => 
                                     '/www/search-engine/databases/test-map_names/test',
                                -multi => 4,
            -file_mode => 0644,
            -lock_mode => 'EX',
         -lock_timeout => 30,
       -blocking_locks => 0,
            -cachesize => 1000000,
        -write_through => 0, 
      -read_write_mode => 'RDONLY';
        });

 my $inv_map = Search::InvertedIndex->new({ 
                '-database' => $database,
       '-search_cache_size' => 1000,
        '-search_cache_dir' => '/var/tmp/search_cache',
     });

The -database parameter is required and must be a 'Search::InvertedIndex::DB::...' type database object. The other two parameters are optional and define the location and size of the search cache. If omitted, no search caching will be done.

lock($parm_ref);

Changes a lock on the underlaying database.

Forces 'sync' if the stat is changed from 'EX' to a lower lock state (i.e. 'SH' or 'UN'). Croaks on errors.

Example:

    $inv->lock({ -lock_mode => 'EX' [, -lock_timeout => 30] [, -blocking_locks => 0],
          });

The only _required_ parameter is the -lock_mode. The other parameters can be inherited from the object state. If the other parameters are used, they change the object state to match the new settings.

status($parm_ref);

Returns the requested status line for the database. Allowed requests are '-open', and '-lock'.

Example 1: my $status = $inv_map->status(-open); # Returns either '1' or '0'

Example 2: my $status = $inv_map->status(-lock_mode); # Returns 'UN', 'SH' or 'EX'

update($parm_ref);

Performs an update on the map. This is designed for adding/changing/deleting a bunch of related information in a single block update. It takes a Search::InvertedIndex::Update object as input. It assumes that you wish to remove all references to the specified index and replace them with a new list of references. It can also will update the -data for the -index. If -data is passed and the -index does not already exist, a new index record will be created. It is a fatal error to pass a non-existant index without a -data parm to initialize it. It is also a fatal error to pass an update for a non-existant -group.

Passing an empty -keys has the effect of deleting the index from group (but not from the system).

Example:

 my $update = Search::InvertedIndex::Update->new(...);
 $inv_map->update({ -update => $update });

It is much faster to update a index using the update method than the add_entry_to_group method in most cases because the batching of changes allows for efficiency optimizations when there is more than one key.

preload_update($parm_ref);

'preload_update' places the passed 'update' object data into a pending queue which is not reflected in the searchable database until the 'update_group' method has been called. This allows the loading process to be streamlined for maximum performance on large full updates. This method is not appropriate to incremental updates as the 'update_group' method destroys the previous searchable data set on execution.

It also places the database effectively offline during the update, so this is not a suitable method for updating a 'online' database. Updates should happen on an 'offline' copy that is then swapped into place with the 'online' database.

Example:

 my $update = Search::InvertedIndex::Update->new(...);
 $inv_map->preload_update({ -update => $update });
        .
                .
                .
 $inv_map->update_group({ -group => 'test' });
clear_preload_update_for_group($parm_ref);

This clears all the data from the preload area for the specified group.

update_group($parm_ref);

This clears the specifed group and loads all preloaded data (updates batch loaded through the 'preload_update' method pending finalization.

This is by far the fastest way to load a large set of data into the search system - but it is an 'all or nothing' approach. No 'incremental' updating is possible via this interface - the update_group completely erases all previously searchable data from the group and replaces it with the pending 'preload'ed data.

Examples:

  $inv_map->update_group({ -group => 'test' });

  $inv_map->update_group({ -group => 'test', -block_size => 65536 });

-block_size determines the 'chunking factor' used to limit the amount of memory the update uses (it corresponds roughly to the number of line entry items to be processed in memory at one time). Higher '-block_size's will improve performance until you run out of real memory. The default is 65536.

Since an exclusive lock should be held during the entire process, the database is essentially inaccessible until the update is complete. It is probably inadvisable to use this method of updating without keeping an 'online' and a seperate 'offline' database and copy over the 'offline' to 'online' after completion of the mass update on the 'offline' database.

search($parm_ref);

Performs a query on the map and returns the results as a Search::InvertedIndex::Result object containing the keys and rankings.

Example:

 my $query = Search::InvertedIndex::Query->new(...);
 my $result = $inv_map->search({ -query => $query });

Performs a complex multi-key match search with boolean logic and optional search term weighting.

The search request is formatted as follows:

my $result = $inv_map->search({ -query => $query });

where '$query' is a Search::InvertedIndex::Query object.

Each node can either be a specific search term with an optional weighting term (a Search::InvertedIndex::Query::Leaf object) or a logic term with its own sub-branches (a Search::Inverted::Query object).

The weightings are applied to the returned matches for each search term by multiplication of their base ranking before combination with the other logic terms.

This allows recursive use of search to resolve arbitrarily complex boolean searches and weight different search terms.

data_for_index($parm_ref);

Returns the data record for the passed -index. Returns undef if no matching -index is in the system.

Example: my $data = $self->data_for_index({ -index => $index });

clear_all;

Completely clears the contents of the database and the search cache.

clear_cache;

Completely clears the contents of the search cache.

close;

Closes the currently open -map and flushes all associated buffers.

DESTROY;

Closes the currently open -map and flushes all associated buffers.

number_of_groups;

Returns the raw number of groups in the system.

Example: my $n = $inv_map->number_of_groups;

number_of_indexes;

Returns the raw number of indexes in the system.

Example: my $n = $inv_map->number_of_indexes;

number_of_keys;

Returns the raw number of keys in the system.

Example: my $n = $inv_map->number_of_keys;

number_of_indexes_in_group($parm_ref);

Returns the raw number of indexes in a specific group.

Example: my $n = $inv_map->number_of_indexes_in_group({ -group => $group });

number_of_keys_in_group($parm_ref);

Returns the raw number of keys in a specific group.

Example: my $n = $inv_map->number_of_keys_in_group({ -group => $group });

add_group($parm_ref);

Adds a new '-group' to the map. There is normally no need to call this method from outside the module. The addition of new -groups is done automatically when adding new entries.

Example: $inv_map->add_group({ -group => $group });

croaks if unable to successfuly create the group for some reason.

It silently eats attempts to create an existing group.

add_index($parm_ref);

Adds a index entry to the system.

Example: $inv_map->add_index({ -index => $index, -data => $data });

If the 'index' is the same as an existing index, the '-data' for that index will be updated.

-data can be pretty much any scalar. strings/object/hash/array references are ok. They will be transparently serialized using Storable.

This method should be called to set the '-data' record returned by searches to something useful. If you do not, you will have to maintain the information you want to show to users seperately from the main search engine core.

The method returns the index_enum of the index.

add_index_to_group($parm_ref);

Adds an index entry to a group. If the index does not already exist in the system, adds it to the system as well.

Example: $inv_map->add_index_to_group({ -group => $group, -index => $index[, -data => $data]});

Returns the 'index_enum' for the index record.

If the 'index' is the same as an existing key, the 'index_enum' of the existing index will be returned.

There is normally no need to call this method directly. Addition of index to groups is handled automatically during addition of new entries.

It cannot be used to add index to non-existant groups. This is a feature not a bug.

add_key_to_group($parm_ref);

Adds a key entry to a group.

Example: $inv_map->_add_key({ -group => $group, -key => $key });

Returns the 'key_enum' for the key record.

If the 'key' is the same as an existing key, the 'key_enum' of the existing key will be returned.

There is normally no need to call this method directly. Addition of keys to groups is handled automatically during addition of new entries.

It cannot be used to add keys to non-existant groups. This is a feature not a bug.

add_entry_to_group($parm_ref);

Adds a reference to a particular index for a key with a ranking to a specific group.

Example: $inv_map->add_entry_to_group({ -group => $group, -key => $key, -index => $index, -ranking => $ranking });

This method cannot be used to create new -indexes or -groups. This is a feature, not a bug. It *will* create new -keys as needed.

remove_group($parm_ref);

Remove all entries for a group from the map.

Example: $inv_map->remove_group({ -group => $group });

This removes all key and key/index entries for the group and all other group specific data from the map.

Use this method when you wish to completely delete a searchable 'group' from the map without disturbing other existing groups.

remove_entry_from_group($parm_ref);

Remove a specific key<->index entry from the map for a group.

Example: $inv_map->remove_entry_from_group({ -group => $group, -key => $key, -index => $index });

Does not remove the -key or -index from the database or the group - only the entries mapping the two to each other.

remove_index_from_group ($parm_ref);

Remove all references to a specific index for all keys for a group.

Example: $inv_map->_remove_index_from_group({ -group => $group, -index => $index });

Note: This *does not* remove the index from the _system_ - just a specific group.

It is a null operation to remove an undeclared index or to remove a declared index from a group where it is not used.

remove_index_from_all ($parm_ref);

Remove all references to a specific index from the system.

Example: $inv_map->_remove_index_from_all({ -index => $index });

This *completely* removes it from all groups and the master system entries.

It is a null operation to remove an undefined index.

remove_key_from_group($parm_ref);

Remove all references to a specific key for all indexes for a group.

Example: $inv_map->remove({ -group => $group, -key => $key });

Returns undef if the key speced was not even in database. Returns '1' if the key speced was in the database, and has been successfully deleted.

croaks on errors.

list_all_keys_in_group($parm_ref);

Returns an anonymous array containing a list of all defined keys in the specified group.

Example: $keys = $inv_map->list_all_keys_in_group({ -group => $group });

Note: This can result in *HUGE* returned lists. If you have a lot of records in the group, you are better off using the iteration support ('first_key_in_group', 'next_key_in_group').

first_key_in_group($parm_ref);

Returns the 'first' key in the -group based on hash ordering.

Returns 'undef' if there are no keys in the group.

Example: my $first_key = $inv_map->first_key_in_group({-group => $group});

next_key_in_group($parm_ref);

Returns the 'next' key in the group based on hash ordering.

Returns 'undef' when there are no more keys in the group or if the passed -key is not in the group map.

Example: my $next_key = $inv_map->next_key_in_group({ -group => $group, -key => $key });

list_all_indexes_in_group($parm_ref);

Returns an anonymous array containing a list of all defined indexes in the group

Example: $indexes = $inv_map->list_all_indexes_in_group({ -group => $group });

Note: This can result in *HUGE* returned lists. If you have a lot of records in the group, you are better off using the iteration support (first_index_in_group(), next_index_in_group())

first_index_in_group;

Returns the 'first' index in the -group based on hash ordering. Returns 'undef' if there are no indexes in the group.

Example: my $first_index = $inv_map->first_index_in_group({ -group => $group });

next_index_in_group({-group = $group, -index => $index});>

Returns the 'next' index in the -group based on hash ordering. Returns 'undef' if there are no more indexes.

Example: my $next_index = $inv_map->next_index_in_group({-group => group, -index => $index});

list_all_indexes;

Returns an anonymous array containing a list of all defined indexes in the map.

Example: $indexes = $inv_map->list_all_indexes;

Note: This can result in *HUGE* returned lists. If you have a lot of records in the map or do not have a lot memory, you are better off using the iteration support ('first_index', 'next_index')

first_index;

Returns the 'first' index in the system based on hash ordering. Returns 'undef' if there are no indexes.

Example: my $first_index = $inv_map->first_index;

next_index({-index = $index});>

Returns the 'next' index in the system based on hash ordering. Returns 'undef' if there are no more indexes.

Example: my $next_index = $inv_map->next_index({-index => $index});

list_all_groups;

Returns an anonymous array containing a list of all defined groups in the map.

Example: $groups = $inv_map->list_all_groups;

If you have a lot of groups in the map or do not have a lot of memory, you are better off using the iteration support ('first_group', 'next_group')

first_group;

Returns the 'first' group in the system based on hash ordering. Returns 'undef' if there are no groups.

Example: my $first_group = $inv_map->first_group;

next_group ({-group = $group });>

Returns the 'next' group in the system based on hash ordering. Returns 'undef' if there are no more groups.

Example: my $next_group = $inv_map->next_group({-group => $group});

Internals

The routines after this point are _internal_ to the object. Do not access them from outside the object.

They are documented in the POD for code maintainence reasons only.

You Have Been Warned. ;)

_bare_search($parm_ref);

Performs a query on the map and returns the results as a an anonymous array containing the keys and rankings.

Example:

 my $query = Search::InvertedIndex::Query->new(...);
 my $result = $inv_map->search({ -query => $query });

Performs a complex multi-key match search with boolean logic and optional search term weighting.

The search request is formatted as follows:

my $result = $inv_map->search({ -query => $query });

where '$query' is a Search::InvertedIndex::Query object.

Each node can either be a specific search term with an optional weighting term (a Search::InvertedIndex::Query::Leaf object) or a logic term with its own sub-branches (a Search::Inverted::Query object).

The weightings are applied to the returned matches for each search term by multiplication of their base ranking before combination with the other logic terms.

This allows recursive use of search to resolve arbitrarily complex boolean searches and weight different search terms.

Returns a reference to a hash of indexes and their rankings.

_get_data_for_index_enum($parm_ref);

Returns the data record for the passed -index_enum.

Returns undef if no data record exists for the requested -index_enum.

Example: my $data = $self->_get_data_for_index_enum({ -index_enum => $index_enum });

_and($terms);

Takes the passed list of search data results and merges them via logical _and. Merged ranking is the sum of the individual rankings.

_nand($terms);

Takes the passed list of search data results and merges them via logical NAND (Not And). Merged ranking is the sum of the individual rankings.

_or($terms);

Takes the passed list of search data results and merges them via logical OR. Merged ranking is the sum of the individual rankings.

_pack_list($hash_ref);

Internal method. Not for access outside of the module.

Packs the passed hash ref of enum keys and signed 16 bit int values into a dense binary structure. There is an endian dependancy here.

_unpack_list($packed_list);

Internal method. Not for access outside of the module.

Unpacks the passed dense binary structure into an anonymous hash of enum keys and signed 16 bit int values. There is an endian dependancy here.

_increment_enum($enum_value);

Internal method. Not for access outside of the module.

Increments an 'enum' (internally a 12 digit hexadecimal number) by 1.

_untaint($string);

Untaints the passed string. Use with care.

DATABASE STRUCTURES

The inverted database uses a complex overlay built on a generic key/value accessible database (it really is fairly 'database agnostic').

It is organized into sub-sets of information by database key name space:

 ; Counter. Incremented for new groups, decremented for deleted groups.
 number_of_groups        -> # (decimal integer) 
 
 ; Counter. Incremented for new indexes, decremented for deleted indexes.
 number_of_indexes       -> # (decimal integer) 

 ; Counter. Incremented for new keys, decremented for deleted keys.
 number_of_keys          -> # (decimal integer) 
 
 ; The 'high water' mark used in assigning new index_enum keys
 index_enum_counter      -> # (12 digit hex number)

 ; Maps an index ("file") to its assigned index enumeration key
 $INDEX<index>               -> index_enum  

 ; Maps the assigned index enumeration back to the index ("file") and 
 ; provides pointers to the 'next' and 'prev' index_enums in the system
 $INDEX_ENUM<index_enum>         -> _next_index_enum_ _prev_index_enum_ index 

 ; Maps the 'first' 'index_enum' for the system 
 ${INDEX_ENUM}first_index_enum     -> index_enum of 'first' index_enum for the system 

 ; Data record for the index ("File"). Wrapped using 'Storable' 
 $INDEX_ENUM_DATA<index_enum>_data   -> data
 
 ; The 'high water' mark used in assigning new group_enum keys
 group_enum_counter      -> # (12 digit hex number)

 ; Maps a group's name to its assigned group enumeration key
 $GROUP<groupname>           -> group_enum  

 ; Maps the assigned group enumeration key to a group and provides
 ; pointers to the 'next' and 'previous' groups in the system.
 $GROUP_ENUM<group_enum>         -> _prev_group_enum_ _next_group_enum_ $group  

 ; Maps the 'first' 'group_enum' for the system 
 ${GROUP_ENUM}first_group_enum     -> group_enum of 'first' group_enum for the system 

 ; Counter. Incremented for new keys, decremented for deleted keys.
 $GROUP_ENUM_DATA<group_enum>_number_of_keys     -> # (decimal integer)

 ; Counter. Incremented for new indexes, decremented for deleted indexes.
 $GROUP_ENUM_DATA<group_enum>_number_of_indexes  -> #  (decimal integer)

 ; 'High water' mark used in assigning new key_enum values for keys
 $GROUP_ENUM_DATA<group_enum>_key_enum_counter   -> # (12 digit hex number) 

 ; Maps the 'first' 'key_enum' for the group
 $GROUP_ENUM_DATA<group_enum>_first_key_enum         -> key_enum of 'first' key_enum

 ; Maps the 'first' 'index_enum' for the group
 $GROUP_ENUM_DATA<group_enum>_first_index_enum       -> index_enum of 'first' index_enum for the group

 ; network order packed list of (6 byte) key_enums and 
 ; (16 bit signed) relevance rankings for the specified group_enum 
 ; and index_enum
 $GROUP_ENUM_DATA<group_enum>$INDEXED_KEY_LIST<index_enum>           -> key_list 

 ; Pointers to the 'next' and 'previous' index_enums for this group. 
 $GROUP_ENUM_DATA<group_enum>$INDEX_ENUM_GROUP_CHAIN<index_enum>           -> _prev_index_enum_ _next_index_enum_ 

 ; network order packed list of (6 byte) index_enums 
 ; and (16 bit signed) relevance rankings for the specified group_enum
 ; and key_enum
 $GROUP_ENUM_DATA<group_enum>$KEYED_INDEX_LIST<key_enum>             -> index_list

 ; Maps 'key's to 'key_enum's
 $GROUP_ENUM_DATA<group_enum>$KEY_TO_KEY_ENUM<key>                  -> key_enum 

 ; Maps 'key_enum's to 'key's and provides pointers to the
 ; 'next' and 'previous' keys for the group
 $GROUP_ENUM_DATA<group_enum>$KEY_ENUM_TO_KEY_AND_CHAIN<key_enum>             -> _prev_key_enum_ _next_key_enum_ key 

COPYRIGHT

Copyright 1999, Benjamin Franz (<URL:http://www.nihongo.org/snowhare/>) and FreeRun Technologies, Inc. (<URL:http://www.freeruntech.com/>). All Rights Reserved. This software may be copied or redistributed under the same terms as Perl itelf.

AUTHOR

Benjamin Franz

TODO

Everything.

Concept item for evaluation: By storing a dense list of all indexed keywords, you would be able to use a regular expression or other fuzzy search matching scheme comparatively efficiently, locate possible words via a grep and then search on the possibilities. Seems to make sense to implement that as _another_ module that uses this module as a backend. 'Search::InvertedIndex::Fuzzy' perhaps.

SEE ALSO

 Search::InvertedIndex::Query  Search::InvertedIndex::Query::Leaf
 Search::InvertedIndex::Result Search::InvertedIndex::Update
 Search::InvertedIndex::DB::DB_File_SplitHash