The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Search::QueryParser - parses a query string into a data structure suitable for external search engines

SYNOPSIS

  my $qp = new Search::QueryParser;
  my $s = '+mandatoryWord -excludedWord +field:word "exact phrase"';
  my $query = $qp->parse($s)  or die "Error in query : " . $qp->err;
  $someIndexer->search($query);

  # query with comparison operators and implicit plus (second arg is true)
  $query = $qp->parse("txt~'^foo.*' date>='01.01.2001' date<='02.02.2002'", 1);

  # boolean operators (example below is equivalent to "+a +(b c) -d")
  $query = $qp->parse("a AND (b OR c) AND NOT d");

DESCRIPTION

This module parses a query string into a data structure to be handled by external search engines. For examples of such engines, see File::Tabular and Search::Indexer.

The query string can contain simple terms, "exact phrases", field names and comparison operators, '+/-' prefixes, parentheses, and boolean connectors.

The parser can be parameterized by regular expressions for specific notions of "term", "field name" or "operator" ; see the new method. The parser has no support for lemmatization or other term transformations : these should be done externally, before passing the query data structure to the search engine.

The data structure resulting from a parsed query is a tree of terms and operators, as described below in the parse method. The interpretation of the structure is up to the external search engine that will receive the parsed query ; the present module does not make any assumption about what it means to be "equal" or to "contain" a term.

QUERY STRING

The query string is decomposed into "items", where each item has an optional sign prefix, an optional field name and comparison operator, and a mandatory value.

Sign prefix

Prefix '+' means that the item is mandatory. Prefix '-' means that the item must be excluded. No prefix means that the item will be searched for, but is not mandatory.

As far as the result set is concerned, +a +b c is strictly equivalent to +a +b : the search engine will return documents containing both terms 'a' and 'b', and possibly also term 'c'. However, if the search engine also returns relevance scores, query +a +b c might give a better score to documents containing also term 'c'.

See also section "Boolean connectors" below, which is another way to combine items into a query.

Field name and comparison operator

Internally, each query item has a field name and comparison operator; if not written explicitly in the query, these take default values '' (empty field name) and ':' (colon operator).

Operators have a left operand (the field name) and a right operand (the value to be compared with); for example, foo:bar means "search documents containing term 'bar' in field 'foo'", whereas foo=bar means "search documents where field 'foo' has exact value 'bar'".

Here is the list of admitted operators with their intended meaning :

:

treat value as a term to be searched within field. This is the default operator.

~ or =~

treat value as a regex; match field against the regex.

!~

negation of above

== or =, <=, >=, !=, <, >

classical relational operators

Operators :, ~, =~ and !~ admit an empty left operand (so the field name will be ''). Search engines will usually interpret this as "any field" or "the whole data record".

Value

A value (right operand to a comparison operator) can be

  • just a term (as recognized by regex rxTerm, see new method below)

  • A quoted phrase, i.e. a collection of terms within single or double quotes.

    Quotes can be used not only for "exact phrases", but also to prevent misinterpretation of some values : for example -2 would mean "value '2' with prefix '-'", in other words "exclude term '2'", so if you want to search for value -2, you should write "-2" instead. In the last example of the synopsis, quotes were used to prevent splitting of dates into several search terms.

  • a subquery within parentheses. Field names and operators distribute over parentheses, so for example foo:(bar bie) is equivalent to foo:bar foo:bie. Nested field names such as foo:(bar:bie) are not allowed. Sign prefixes do not distribute : +(foo bar) +bie is not equivalent to +foo +bar +bie.

Boolean connectors

Queries can contain boolean connectors 'AND', 'OR', 'NOT' (or their equivalent in some other languages). This is mere syntactic sugar for the '+' and '-' prefixes : a AND b is translated into +a +b; a OR b is translated into (a b); NOT a is translated into -a. +a OR b does not make sense, but it is translated into (a b), under the assumption that the user understands "OR" better than a '+' prefix. -a OR b does not make sense either, but has no meaningful approximation, so it is rejected.

Combinations of AND/OR clauses must be surrounded by parentheses, i.e. (a AND b) OR c or a AND (b OR c) are allowed, but a AND b OR c is not.

METHODS

new
  new(rxTerm   => qr/.../, rxOp => qr/.../, ...)

Creates a new query parser, initialized with (optional) regular expressions :

rxTerm

Regular expression for matching a term. Of course it should not match the empty string. Default value is qr/[^\s()]+/. A term should not be allowed to include parenthesis, otherwise the parser might get into trouble.

rxField

Regular expression for matching a field name. Default value is qr/\w+/ (meaning of \w according to use locale).

rxOp

Regular expression for matching an operator. Default value is qr/==|<=|>=|!=|=~|!~|:|=|<|>|~/. Note that the longest operators come first in the regex, because "alternatives are tried from left to right" (see "Version 8 Regular Expressions" in perlre) : this is to avoid a<=3 being parsed as a < '=3'.

rxOpNoField

Regular expression for a subset of the operators which admit an empty left operand (no field name). Default value is qr/=~|!~|~|:/. Such operators can be meaningful for comparisons with "any field" or with "the whole record" ; the precise interpretation depends on the search engine.

rxAnd

Regular expression for boolean connector AND. Default value is qr/AND|ET|UND|E/.

rxOr

Regular expression for boolean connector OR. Default value is qr/OR|OU|ODER|O/.

rxNot

Regular expression for boolean connector NOT. Default value is qr/NOT|PAS|NICHT|NON/.

parse
  $q = $queryParser->parse($queryString, $implicitPlus);

Returns a data structure corresponding to the parsed string. The second argument is optional; if true, it adds an implicit '+' in front of each term without prefix, so parse("+a b c -d", 1) is equivalent to parse("+a +b +c -d"). This is often seen in common WWW search engines as an option "match all words".

The return value has following structure :

  { '+' => [{field=>'f1', op=>':', value=>'v1'}, 
            {field=>'f2', op=>':', value=>'v2'}, ...],
    ''  => [...],
    '-' => [...]
  }

In other words, it is a hash ref with 3 keys '+', '' and '-', corresponding to the 3 sign prefixes (mandatory, ordinary or excluded items). Each key holds either a ref to an array of items, or undef (no items with this prefix in the query).

An item is a hash ref containing

field

scalar, field name (may be the empty string)

op

scalar, operator

value

Either

  • a scalar (simple term), or

  • an array ref (list of terms, corresponding to an "exact phrase" in the query), or

  • a recursive ref to another query structure. In that case, op is necessarily '()' ; this corresponds to a subquery in parentheses.

In case of a parsing error, parse returns undef; method err can be called to get an explanatory message.

err
  $msg = $queryParser->err;

Message describing the last parse error

unparse
  $s = $queryParser->unparse($query);

Returns a string representation of the $query data structure.

AUTHOR

Laurent Dami, <laurent.dami AT etat ge ch>

COPYRIGHT AND LICENSE

Copyright (C) 2005 by Laurent Dami.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.