The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

App::ElasticSearch::Utilities::QueryString - CLI query string fixer

VERSION

version 8.8

SYNOPSIS

This class provides a pluggable architecture to expand query strings on the command-line into complex Elasticsearch queries.

ATTRIBUTES

context

Defaults to 'query', but can also be set to 'filter' so the elements will be added to the 'must' or 'filter' parameter.

search_path

An array reference of additional namespaces to search for loading the query string processing plugins. Example:

    $qs->search_path([qw(My::Company::QueryString)]);

This will search:

    App::ElasticSearch::Utilities::QueryString::*
    My::Company::QueryString::*

For query processing plugins.

default_join

When fixing up the query string, if two tokens are found next to eachother missing a joining token, join using this token. Can be either AND or OR, and defaults to AND.

plugins

Array reference of ordered query string processing plugins, lazily assembled.

fields_meta

A hash reference with the field data from App::ElasticSearch::Utilities::es_index_fields.

METHODS

expand_query_string(@tokens)

This function takes a list of tokens, often from the command line via @ARGV. Uses a plugin infrastructure to allow customization.

Returns: App::ElasticSearch::Utilities::Query object

TOKENS

The token expansion plugins can return undefined, which is basically a noop on the token. The plugin can return a hash reference, which marks that token as handled and no other plugins receive that token. The hash reference may contain:

query_string

This is the rewritten bits that will be reassembled in to the final query string.

condition

This is usually a hash reference representing the condition going into the bool query. For instance:

    { terms => { field => [qw(alice bob charlie)] } }

Or

    { prefix => { user_agent => 'Go ' } }

These conditions will wind up in the must or must_not section of the bool query depending on the state of the the invert flag.

invert

This is used by the bareword "not" to track whether the token invoked a flip from the must to the must_not state. After each token is processed, if it didn't set this flag, the flag is reset.

dangles

This is used for bare words like "not", "or", and "and" to denote that these terms cannot dangle from the beginning or end of the query_string. This allows the final pass of the query_string builder to strip these words to prevent syntax errors.

Extended Syntax

The search string is pre-analyzed before being sent to ElasticSearch. The following plugins work to manipulate the query string and provide richer, more complete syntax for CLI applications.

App::ElasticSearch::Utilities::QueryString::Barewords

The following barewords are transformed:

    or => OR
    and => AND
    not => NOT

App::ElasticSearch::Utilities::QueryString::Text

Provides field prefixes to manipulate the text search capabilities.

Terms Query via '='

Provide an '=' prefix to a query string parameter to promote that parameter to a term filter.

This allows for exact matches of a field without worrying about escaping Lucene special character filters.

E.g.:

    user_agent:"Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1"

Is evaluated into a weird query that doesn't do what you want. However:

    =user_agent:"Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1"

Is translated into:

    { term => { user_agent => "Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1" } }

Wildcard Query via '*'

Provide an '*' prefix to a query string parameter to promote that parameter to a wildcard filter.

This uses the wild card match for text fields to making matching more intuitive.

E.g.:

    *user_agent:"Mozilla*"

Is translated into:

    { wildcard => { user_agent => "Mozilla* } }

Regexp Query via '/'

Provide an '/' prefix to a query string parameter to promote that parameter to a regexp filter.

If you want to use regexp matching for finding data, you can use:

    /message:'\\bden(ial|ied|y)'

Is translated into:

    { regexp => { message => "\\bden(ial|ied|y)" } }

Fuzzy Matching via '~'

Provide an '~' prefix to a query string parameter to promote that parameter to a fuzzy filter.

    ~message:deny

Is translated into:

    { fuzzy => { message => "deny" } }

Phrase Matching via '+'

Provide an '+' prefix to a query string parameter to promote that parameter to a match_phrase filter.

    +message:"login denied"

Is translated into:

    { match_phrase => { message => "login denied" } }

Automatic Match Queries for Text Fields

If the field meta data is provided and the field is a text type, the query will automatically be mapped to a match query.

    # message field is text
    message:"foo"

Is translated into:

    { match => { message => "foo" } }

App::ElasticSearch::Utilities::QueryString::IP

If a field is an IP address uses CIDR Notation, it's expanded to a range query.

    src_ip:10.0/8 => src_ip:[10.0.0.0 TO 10.255.255.255]

App::ElasticSearch::Utilities::QueryString::Ranges

This plugin translates some special comparison operators so you don't need to remember them anymore.

Example:

    price:<100

Will translate into a:

    { range: { price: { lt: 100 } } }

And:

    price:>50,<100

Will translate to:

    { range: { price: { gt: 50, lt: 100 } } }

Supported Operators

gt via >, gte via >=, lt via <, lte via <=

App::ElasticSearch::Utilities::QueryString::Underscored

This plugin translates some special underscore surrounded tokens into the Elasticsearch Query DSL.

Implemented:

_prefix_

Example query string:

    _prefix_:useragent:'Go '

Translates into:

    { prefix => { useragent => 'Go ' } }

App::ElasticSearch::Utilities::QueryString::FileExpansion

If the match ends in .dat, .txt, .csv, or .json then we attempt to read a file with that name and OR the condition:

    $ cat test.dat
    50  1.2.3.4
    40  1.2.3.5
    30  1.2.3.6
    20  1.2.3.7

Or

    $ cat test.csv
    50,1.2.3.4
    40,1.2.3.5
    30,1.2.3.6
    20,1.2.3.7

Or

    $ cat test.txt
    1.2.3.4
    1.2.3.5
    1.2.3.6
    1.2.3.7

Or

    $ cat test.json
    { "ip": "1.2.3.4" }
    { "ip": "1.2.3.5" }
    { "ip": "1.2.3.6" }
    { "ip": "1.2.3.7" }

We can source that file:

    src_ip:test.dat      => src_ip:(1.2.3.4 1.2.3.5 1.2.3.6 1.2.3.7)
    src_ip:test.json[ip] => src_ip:(1.2.3.4 1.2.3.5 1.2.3.6 1.2.3.7)

This make it simple to use the --data-file output options and build queries based off previous queries. For .txt and .dat file, the delimiter for columns in the file must be either a tab or a null. For files ending in .csv, Text::CSV_XS is used to accurate parsing of the file format. Files ending in .json are considered to be newline-delimited JSON.

You can also specify the column of the data file to use, the default being the last column or (-1). Columns are zero-based indexing. This means the first column is index 0, second is 1, .. The previous example can be rewritten as:

    src_ip:test.dat[1]

or: src_ip:test.dat[-1]

For newline delimited JSON files, you need to specify the key path you want to extract from the file. If we have a JSON source file with:

    { "first": { "second": { "third": [ "bob", "alice" ] } } }
    { "first": { "second": { "third": "ginger" } } }
    { "first": { "second": { "nope":  "fred" } } }

We could search using:

    actor:test.json[first.second.third]

Which would expand to:

    { "terms": { "actor": [ "alice", "bob", "ginger" ] } }

This option will iterate through the whole file and unique the elements of the list. They will then be transformed into an appropriate terms query.

Wildcards

We can also have a group of wildcard or regexp in a file:

    $ cat wildcards.dat
    *@gmail.com
    *@yahoo.com

To enable wildcard parsing, prefix the filename with a *.

    es-search.pl to_address:*wildcards.dat

Which expands the query to:

    {
      "bool": {
        "minimum_should_match":1,
        "should": [
           {"wildcard":{"to_outbound":{"value":"*@gmail.com"}}},
           {"wildcard":{"to_outbound":{"value":"*@yahoo.com"}}}
        ]
      }
    }

No attempt is made to verify or validate the wildcard patterns.

Regular Expressions

If you'd like to specify a file full of regexp, you can do that as well:

    $ cat regexp.dat
    .*google\.com$
    .*yahoo\.com$

To enable regexp parsing, prefix the filename with a ~.

    es-search.pl to_address:~regexp.dat

Which expands the query to:

    {
      "bool": {
        "minimum_should_match":1,
        "should": [
          {"regexp":{"to_outbound":{"value":".*google\\.com$"}}},
          {"regexp":{"to_outbound":{"value":".*yahoo\\.com$"}}}
        ]
      }
    }

No attempt is made to verify or validate the regexp expressions.

App::ElasticSearch::Utilities::QueryString::Nested

Implement the proposed nested query syntax early. Example:

    nested_path:"field:match AND string"

AUTHOR

Brad Lhotsky <brad@divisionbyzero.net>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2024 by Brad Lhotsky.

This is free software, licensed under:

  The (three-clause) BSD License