The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Search.pm - provide framework for multiple searches

SYNOPSIS

    use Search::TextSearch;
    use Search::Glimpse;

DESCRIPTION

This module is a base class interfacing to search engines. It defines an interface that can be used by any of the Search search modules, such as Search::Glimpse or Search::TextSearch, which are the standard ones included with the module.

It exports no routines, just provides methods for the other classes.

Search Comparisons

                         Text           Glimpse 
                        -------        --------
 Speed                   Medium        Dependent on uniqueness of terms
 Requires add'l software No            Yes   
 Search individually     Yes           Yes
  for fields               
 Search multiple files   Yes           Yes
 Allows spelling errors  No            Yes
 Fully grouped matching  No            No

Methods

Only virtual methods are supported by the Search::Base class. Those search modules that inherit from it may export static methods if desired. There are the "Global Parameter Method", "Column Setting Functions", and "Row Setting Functions". "SEE ALSO" Search::Glimpse, Search::TextSearch.

Global Parameter Method

    $s->global(param,val); 
    $status = $s->global(param); 
    %globals = $s->global();

Allows setting of the parameters that are global to the search. The standard ones are listed below.

base_directory

Those engines which look for matches in index files can read this to get the base directory of the images.

case_sensitive

This is a global version of the cases field. If set, the search engine should return only matches which exactly match, including distinction between lower- and upper-case letters. The default is not set, to ignore case.

error_page

A page or filename to be displayed when a user error occurs. Passed along with a single string parameter to the error_routine, as in:

     &{$self->{global}->{error_routine}}
        ($self->{global}->{error_page}, $msg);
error_routine

Reference to a subroutine which will send errors to the user. The default is '\&croak'.

exact_match

Strings sent as match specifications will have double quotes put around them, meaning the words must be found in the order they are put in. Any double quotes contained in the string will be silently deleted.

first_match

The number of the first match returned in a of more_matches. This tells the calling program where to start their increment.

history

The number of searches to keep in the history.

head_skip

Used for the TextSearch module to indicate lines that should be skipped at the beginning of the file. Allows a header to be skipped if it should not be searched.

The number of searches to keep in the history.

index_delim

The delimiter character to terminate the return code in an ascii index file. In Search::Glimpse and Search::TextSearch, the default is "\t", or TAB. This is also the default for {return_delim} if that is not set.

If field-matching is being used, this the character/string used for splitting the fields. If properly escaped, and {return_delim} is used for joining fields, it can be a regular expression -- Perl style.

index_file

A specification of an index file (or files) to search. The usage is left to the module -- it could, for example, be an anonymous array (as in Search::TextSearch) or wild-card specification for multiple indices (as in Search::Glimpse).

log_routine

A reference to a subroutine to log an error or status indication. By default, it is '\&Carp::carp';

match_limit

The number of matches to return. Not to be confused with max_matches, at which number the search will terminate. Additional matches will be stored in the file pointed to by save_dir, session_id, and search_mod. The default is 50.

matches

Set by the search routine to indicate the number of matches in the last search. If the engine can return the total number of matches (without the data) then that is the result.

min_string

The minimum size of search string that is supported. Using a size of less than 4 for Glimpse, for example, is not wise.

next_pointer

The pointer to the next list of matches. This is for engines that can return only a subset of the elements starting from an element. For making a next match list.

next_url

The base URL that should be used to invoke the more_matches function. Provided as an object-contained scratchpad value for the calling routine -- it will not be used or modified by Search::Base.

There are a couple of useful ways to use this to invoke the proper more_matches search that are shown in the example search CGI provided with this module set. Both involve setting the next_url and combining it with a random session_id and search_mod.

If set, the search engine should return matches which match any of the search patterns. The default is not set, requiring matches to all of the keywords for a match return.

overflow

Set by the search routine if it matched the maximum number before reaching the end of the search. Set to undef if not supported.

record_delim

This sets the type of record that will be searched. For the Search::TextSearch module, this is the same as Perl -- in fact, the default is to use $/ (at the time of object creation) as the default.

For the Search::Glimpse module, the following mappings occur:

    $/ = "\n"        Glimpse default
    $/ = "\n\n"      -d '$$'
    $/ = ""          -d '$$'
    $/ = undef       -d 'NeVAiRbE'
    anything else    passed on as is (-d 'whatever')

One useful pattern is '^From ', which will make each email message in a folder a separate record.

If you are doing this, and expect to be doing field returns, it will probably be useful to set "\n\n" or "\n" as the default index_delim. If used in combination with the obscure anonymous hash definition of return_fields, you can search and return mail headers on each message that matches.

return_delim

The delimiter character to join fields that are cut out of the matched line/paragraph/page. The default is to set it to be the same as {index_delim} if not explicitly set.

return_fields

The fields that will be returned from the line/paragraph/page matched. This is not to be confused with the fields setting -- it will not affect the matching, only the returned fields. The default (when it is undefined) is to return only the first field. There are several options for this field.

If the value is an ARRAY reference, an integer list of the columns to be returned is assumed.

If the value is a HASH reference, then all words found AFTER the key of the hash (with assumed but not required whitespace as a separator) up to the value of the item (used as a delimiter). The following example will print the value of the From:, Reply-to: and Date: headers from any message in your (UNIX) system mailbox that contains 'foobar'.

    $s = new Search::TextSearch
            return_fields => {
                            From: => "\n",
                            Reply-To: => "\n",
                            Date: => "\n",
                            },
            record_delim    => "\nFrom ",
            search_file     => $ENV{MAIL};

    print $s->search('foobar');
return_file_name

All return options will be ignored, and the file names of any matches will be returned instead. The limit match-to-field routines are still enabled for Search::TextSearch, but not for Glimpse, since the 'glimpse -l' option is used for that.

save_dir

The directory that search caches (for the more_matches function) will be saved in. Only applies to file save mode.

search_history

An anonymous array [size history] of 'unevaled' strings that contain the search history. When evaled by the Search::Base module, will create an environment in the object that duplicates previous searches. It is up to the application to save the value.

search_mod

This is used to develop a unique search save id for a user with a consistent session_id. For the more_matches function.

search_port

The port (passed to glimpse with the -K option) that is to be used for a network-attached search server.

search_server

The host name of a network-attached search server, passed to glimpse with the -J option.

session_id

This is used to determine the save file name or hash key used to cache a search (for the more_matches) function.

speed

The speed of search desired, in an integer value from one to 10. Those engines that have a faster method to search (possibly at a cost of comprensivity) can adjust their routines accordingly.

spelling_errors

Those engines that support "tolerant matching" can honor this parameter to set the number of spelling errors that they will allow for. This can slow search dramatically on big databases. Ignored by search engines that don't have the capability.

substring_match

If set, the search engine should return partial, or substring, matches. The default is not set, to indicate whole word matching. This can slow search dramatically on big databases.

uneval_routine

A reference to a subroutine to save the search history to a cache. By default, it is '\&uneval', the routine supplied with Search::Base.

METHODS

Virtual methods provided

more_matches

Given a file with return codes from previous searches, one per line, returns an array with the correct matches in the array. Opens the file in directory save_dir, with session information appended (the session_id and search_mod), and returns match_limit matches, starting at next_pointer.

This is the main method defined in the individual search engine. You can submit a single parameter for a quick search, which will be interpreted as the one and only search specification, overriding any settings residing in the specs array. Options can be specified at object creation, or separately with the global method. Or a search_spec can be specified, which will temporarily override the setting in specs (for that invocation only).

Otherwise, the parameters are named search options as documented above. Examples:

    # Simple search with default options for 'foobar' in the named file
    $s = new Search::TextSearch search_file => '/var/adm/messages');
    @found = $s->search('foobar');

    # Search for 'foobar' in /var/adm/messages, return only fields 0 and 2
    # where fields are separated by spaces
    $s = new Search::TextSearch;
    @found = $s->search( search_file   => '/var/adm/messages',
                         search_spec   => 'foobar',
                         return_fields => [0,2],
                         return_delim  => ' ',
                         index_delim   => '\s+'                 );

    # Search for 'foobar' in any file containing 'messages' in
    # the default glimpse index, return the file names
    $s = new Search::Glimpse;
    @found = $s->search( search_spec   => 'foobar',
                         search_file   => 'messages',
                         return_file_name  => 1,       );

    # Same as above except use glimpse index located in /var directory
    $s = new Search::Glimpse;
    @found = $s->search( search_spec       => 'foobar',
                         base_directory    => '/var',
                         search_file       => 'messages',
                         return_file_name  => 1,       );

    # Search all files in /etc
    # Return file names with  lines that have 'foo' in field 1
    # and 'bar' in field 3, with case sensitivity
    # (using the default field delimiter of \t)
    $s = new Search::TextSearch;
    $s->rowpush('foo', 1);
    $s->rowpush('bar', 3);
    chop(@files = `ls /etc`);
    @found = $s->search( search_file   => [@files],
                         case_sensitive  => 1,
                         return_file_name  => 1,       );

    # Same as above using direct access to specs/fields
    $s = new Search::TextSearch;
    $s->specs('foo', 'bar');
    $s->fields(1, 3);
    chop(@files = `ls /etc`);
    @found = $s->search( search_file   => [@files],
                         case_sensitive  => 1,
                         return_file_name  => 1,       );
    # Repeat search with above settings, except for specs,
    # if less than 4 matches are found
    if(@found < 4) {
        @found = $s->search('foo');
    }

Column Setting Methods

Column setting functions allow the setting of a series of columns of match criteria:

    $search->specs('foo', 'foo bar', 'foobar');
    $search->fields(1, 3, 4);

This is an example for the specs and fields match criteria, which are the search specifications and the the fields to search, respectively. Similar functions are provided for mods, links, cases, negates, open_parens, and close_parens.

For the included Search::Glimpse and Search::TextSearch modules, an item will match the above example only if field (or column) 1 contains 'foo', field 3 contains 'foo' and/or 'bar', and field 4 contains foobar. The setting of the case_sensitive, or_search, and substring_match terms will be honored as well.

For simple searches, only one term need be set, and the grouping functions links, open_parens, and close_parens are ignored. In most cases, if the setting for a particular column is not defined for a row, the value in the global setting is used.

specs

The search text, raw, per field. This is the only item that necessarily needs to be set to do a search.

If more than one specification is present, there are three forms of behavior supported by the included Search::TextSearch and Search::Glimpse modules. First, if there are multiple search specifications, they are combined together, just as they would if separated by spaces (and not quoted). Second, if the number of specs matches the number of fields, each spec must match the field that it is associated with (subject to the or_search and case_sensitive settings within that field). Last, if there are more fields than specs, only the columns in fields are searched for the combined specs.

fields

The column numbers to search, where a column is a field separated by index_delim. In the Search::TextSearch and Search::Glimpse modules, this becomes operative in one of two ways. If the number of specs match the number of fields, each specification is separately checked against its associated field. If the number of fields is different from the number of specs, all specs are applied, but only to the text in the specified fields. Both first match on all of the text in the row, then filter the match with another routine that checks for matches in the specified fields.

mods

Modifies the match criteria. Recognized modifications might be:

    start   Matches when the field starts with the spec
    sub     Match substrings.

Not supported in the included modules.

The link to the previous row. If there are two fields to search, with two different specs, this determines whether the search is AND, OR, or NEAR. For engines that support it, NEAR matches with in $self->global('near') words of the previous word (forward only). Not supported in the included modules.

cases

For advanced search engines that support full associative case-sensitivity. Determines whether the particular match in this set will be case-sensitive or not. If the search engine doesn't support independent case-sensitivity (like the Search::TextSearch and Search::Glimpse modules), the value in or_search will be used. Not supported in the included modules.

negates

Negates the sense of the match for that term. Allows searches like "spec1 AND NOT spec2" or "NOT spec1 AND spec2". Not supported in the included modules.

open_parens

Determines whether a logical parentheses will be placed on the left of the term. Allows grouping of search terms for more expressive matching, i.e. "(AUTHOR Shakespeare AND TYPE Play ) NOT TITLE Hamlet". Not supported in the included modules.

close_parens

Determines whether a logical parentheses will be placed on the right of the term. Not supported in the included modules.

Row Setting Methods

Row setting functions allow the setting of all columns in a row.

    $query->rowpush($field,$spec,$case,$mod,$link,$left,$right);
    ($field,$spec,$case,$mod,$link,$left,$right) = $query->rowpop();
    @oldvals = $query->rowset(n,$field,$spec,$case,$mod,$link,$left,$right);

You can ignore the trailing parameters for a simple search. For example:

    $field = 'author';
    $spec = 'forsythe';
    $limit = 25;
    $query = new Search::Glimpse;
    $query->rowpush($field,$spec);
    @rows = $query->search( match_limit => 25 );

This searches the field 'author' for the name 'forsythe', with all other options at their defaults (ignore case, match substrings, not negated, no links, no grouping), and will return 25 matches (sets the matchlimit global). For a more complex search, you can add the rest of the parameters as needed.

SEE ALSO

glimpse(1), Search::Glimpse(3L), Search::TextSearch(3L)

AUTHOR

Mike Heins, <mikeh@iac.net>

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 326:

'=item' outside of any '=over'

Around line 401:

You forgot a '=back' before '=head2'