The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Search::Tools::Keywords - extract keywords from a search query

SYNOPSIS

 use Search::Tools::Keywords;
 use Search::Tools::RegExp;
 
 my $query = 'the quick fox color:brown and "lazy dog" not jumped';
 
 my $kw = Search::Tools::Keywords->new(
            stopwords           => 'the',
            and_word            => 'and',
            or_word             => 'or',
            not_word            => 'not',
            stemmer             => &your_stemmer_here,       
            ignore_first_char   => '\+\-',
            ignore_last_char    => '',
            word_characters     => $Search::Tools::RegExp::WordChar,
            debug               => 0,
            phrase_delim        => '"',
            charset             => 'iso-8859-1',
            lang                => 'en_US',
            locale              => 'en_US.iso-8859-1'
            );
            
 my @words = $kw->extract( $query );
 # returns:
 #   quick
 #   fox
 #   brown
 #   lazy dog
 
 

DESCRIPTION

Do not confuse this class with Search::Tools::RegExp::Keywords.

Search::Tools::Keywords extracts the meaningful words from a search query. Since many search engines support a syntax that includes special characters, boolean words, stopwords, and fields, search queries can become complicated. In order to separate the wheat from the chafe, the supporting words and symbols are removed and just the actual search terms (keywords) are returned.

This class is used internally by Search::Tools::RegExp. You probably don't need to use it directly. But if you do, read on.

METHODS

new( %opts )

The new() method instantiates a S::T::K object. With the exception of extract(), all the following methods can be passed as key/value pairs in new().

extract( query )

The extract method parses query and returns an array of meaningful words. query can either be a scalar string or an array reference (if multiple queries should be parsed simultaneously).

Only positive words are extracted. In other words, if you search for:

 foo not bar
 

then only foo is returned. Likewise:

 +foo -bar
 

would return only foo.

NOTE: All queries are converted to UTF-8. See the charset param.

stemmer

The stemmer function is used to find the root 'stem' of a word. There are many stemming algorithms available, including many on CPAN. The stemmer function should expect to receive two parameters: the Keywords object and the word to be stemmed. It should return exactly one value: the stemmed word.

Example stemmer function:

 use Lingua::Stem;
 my $stemmer = Lingua::Stem->new;
 
 sub mystemfunc
 {
     my ($kw,$word) = @_;
     return $stemmer->stem($word)->[0];
 }
 
 # and pass to Keywords new() method:
 
 my $keyword_obj = Search::Tools::Keyword->new(stemmer => \&mystemfunc);
     

stopwords

A list of common words that should be ignored in parsing out keywords. May be either a string that will be split on whitespace, or an array ref.

NOTE: If a stopword is contained in a phrase, then the phrase will be tokenized into words based on whitespace, then the stopwords removed.

ignore_first_char

String of characters to strip from the beginning of all words.

ignore_last_char

String of characters to strip from the end of all words.

ignore_case

All queries are run through Perl's built-in lc() function before parsing. The default is 1 (true). Set to 0 (false) to preserve case.

and_word

Default: and|near\d*

or_word

Default: or

not_word

Default: not

wildcard

Default: *

locale

Set a locale explicitly for a Keywords object.If not set, the locale is inherited from the LC_CTYPE environment variable.

lang

Base language. If not set, extracted from locale or defaults to en_US.

charset

Base charset used for converting queries to UTF-8. If not set, extracted from locale or defaults to iso-8859-1.

AUTHOR

Peter Karman perl@peknet.com

Based on the HTML::HiLiter regular expression building code, originally by the same author, copyright 2004 by Cray Inc.

Thanks to Atomic Learning www.atomiclearning.com for sponsoring the development of this module.

COPYRIGHT

Copyright 2006 by Peter Karman. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

HTML::HiLiter, Search::QueryParser