The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CWB::CEQL - The Common Elementary Query Language for CQP front-ends

SYNOPSIS

  # end users: see section "CEQL SYNTAX" below for an overview of CEQL notation

  use CWB::CEQL;
  our $CEQL = new CWB::CEQL;

  # configuration settings (see METHODS section for details and default values)
  $CEQL->SetParam("pos_attribute", "tag");          # p-attribute for POS tags
  $CEQL->SetParam("lemma_attribute", "lem");        # p-attribute for lemmas
  $CEQL->SetParam("simple_pos", \%lookup_table);    # lookup table for simple POS
  $self->SetParam("simple_pos_attribute", "class"); # p-attribute for simple POS
  $self->SetParam("s_attributes", {"s" => 1});      # s-attributes allowed in CEQL queries
  $self->SetParam("default_ignore_case", 1);        # if 1, default to case-folded search
  $self->SetParam("default_ignore_diac", 0);        # if 1, default to accent-folded search
  $self->SetParam("ignore_case", {"word_attribute" => 1, "lemma_attribute" => 1, ...}); # case/accent folding for individual attributes;
  $self->SetParam("ignore_diac", {"word_attribute" => 1, "lemma_attribute" => 0, ...}); # keys are the strings for attribute parameters (above) plus "s_attributes"
  $self->SetParam("tab_optimisation", 1);           # enable TAB query optimisation

  $cqp_query = $CEQL->Parse($ceql_query);
  if (not defined $cqp_query) {
    @error_msg = $CEQL->ErrorMessage;
    $html_msg = $CEQL->HtmlErrorMessage;
  }
  # $cqp_query can now be sent to CQP backend (e.g. with CWB::CQP module)

  #### extend or modify standard CEQL grammar by subclassing ####
  package BNCWEB::CEQL;
  use base 'CWB::CEQL';

  sub lemma {
    ## overwrite 'lemma' rule here (e.g. to allow for BNCweb's ``{bucket/N}'' notation)
    my $orig_result = $self->SUPER::lemma($string); # call original rule if needed
  }

  ## you can now use BNCWEB::CEQL in the same way as CWB::CEQL

DESCRIPTION

This module implements the core syntax of the Common Elementary Query Language (CEQL) as a DPP grammar (see CWB::CEQL::Parser for details). It can either be used directly, adjusting configuration settings with the SetParam method as required, or subclass CWB::CEQL in order to modify and/or extend the grammar. In the latter case, you are strongly advised not to change the meaning of core CEQL features, so that end-users can rely on the same familiar syntax in all CEQL-based Web interfaces.

A complete specification of the core CEQL syntax can be found in section "CEQL SYNTAX" below. This is the most important part of the documentation for end users and can also be found online at http://cwb.sf.net/ceql.php.

Application developers can find an overview of relevant API methods and the available configuration parameters (CWB attributes for different linguistic annotations, default case/accent-folding, etc.) in section "METHODS".

Section "EXTENDING CEQL" explains how to extend or customise CEQL syntax by subclassing CWB::CEQL. It is highly recommended to read the technical documentation in section "STANDARD CEQL RULES" and the source code of the CWB::CEQL module. Extended rules are most conveniently implemented as modified copies of the methods defined there.

CEQL SYNTAX

A gentle tutorial-style introduction to CEQL syntax with many examples and exercises can be found in Chapter 6 (pp. 93-117) of Hoffmann et al. (2008), Corpus Linguistics with BNCweb. A quick referenc with the most commonly used features of CEQL is included in the CQPweb user interfaces and can be accessed e.g. at https://cqpweb.lancs.ac.uk/doc/cqpweb-simple-syntax-help.pdf.

The present document aims to give a complete and precise specification of the core CEQL grammar.

Wildcard Patterns

CEQL is based on wildcard patterns for matching word forms and annotations. A wildcard pattern by itself finds all tokens whose surface form matches the pattern. Wildcard patterns must not contain blanks or other whitespace.

The basic wildcards are

    ?    a single arbitrary character
    *    zero or more characters
    +    one or more characters

These wildcards are often used for prefix or suffix searches, e.g. +able (all words ending in "-able" except for the word "able" itself). Clusters of wildcards specify a minimum number of characters, e.g. ???* for 3 or more.

Most other characters match only themselves. However, all CEQL metacharacters (not just wildcards) must be escaped with a backslash \ to match the literal character (e.g. \? to find a question mark). The full set of metacharacters in the core CEQL grammar is

    ? * + , : ! @ / ( ) [ ] { } _ - < >

Some of them are only interpreted as metacharacters in particular contexts. It is safest, and recommended, to escape every literal ASCII punctuation character with a backslash.

Groups of alternatives are separated by commas and enclosed in square brackets, e.g. [north,south,west,east]. They can include wildcards and an empty alternative can be appended to make the entire set optional (e.g. walk[s,ed,ing,] to match any form of the verb "walk").

Various escape sequences, consisting of a backslash followed by a letter, match specific sets and sequences of characters. Escape sequences recognised by the core CEQL grammar are:

    \a   any single letter
    \A   any sequence of letters (one or more)
    \l   any single lowercase letter
    \L   any sequence of lowercase letters (one or more)
    \u   any single uppercase letter
    \U   any sequence of uppercase letters (one or more)
    \d   any single digit
    \D   any sequence of digits (one or more)
    \w   a single "word" character (letter, number, apostrophe, hyphen)
    \W   any sequence of "word" characters (one or more)

The escape sequences are guaranteed to work correctly for UTF-8 encoded corpora, but may not be fully supported for legacy 8-bit encodings (in which case they might only match ASCII letters and digits).

Wildcard patterns can be negated with a leading exclamation mark !; a negated pattern finds any string that does not match the pattern.

Linguistic Annotation

CEQL queries provide access to three items of token-level annotation in addition to surface forms. They are described below as lemma, POS (part-of-speech tag) and simple POS, which is the original intention. However, keep in mind that corpus search interfaces may be configured to access other annotation layers (say, semantic tags instead of simple POS).

A lemma search is carried out by enclosing the wildcard pattern in curly braces, e.g. {go}. All elements of the wildcard pattern described above must be enclosed in the braces, including negation ({!go}). Note that word form and lemma constraints are mutually exclusive on the same token.

A single-token expression in CEQL combines such a lexical constraint with a part-of-speech tag, separated by an underscore _. The POS tag can either be matched directly with a wildcard pattern, or one of a pre-defined set of simple POS tags can be selected (in curly braces). There are four possible combinations for a full token expression:

    WORD_POS
    {LEMMA}_POS
    WORD_{Simple POS}
    {LEMMA}_{Simple POS}

Keep in mind that POS tags may differ between corpora and make sure to read documentation on the respective tagset for successful POS searches. Full POS constraints are wildcard patterns, which is convenient with complex tagsets. In particular, the pattern can be negated, e.g. can_!MD to exclude the frequent modal reading of can. Also keep in mind that simple POS tags are available only if they have been set up for the corpus at hand by an administrator. Even though simple POS constraints aren't wildcard patterns, they can be negated (e.g. {walk}_{!V}).

The lexical constraint can be omitted in order to match a token only by its POS tag. Assuming the Penn treebank tagset and a simple POS tag A for adjectives, these four token expressions are fully equivalent:

    _JJ*     *_JJ*
    _{A}     *_{A}

Optional modifier flags can be appended to each constraint: :c for case-insensitive matching, :d to ignore diacritics (Unicode combining marks, including all accents and umlauts) and :cd for both. If an annotation defaults to case- or diacritic-insensitive mode, this can be overridden with an uppercase modifier :C, :D or :CD. (Mixed combinations are allowed, e.g. :Cd to override a case-insensitive default but ignore diacritics.) Keep in mind that modifiers go outside curly braces:

    {fiancee}:cd_N*:C

Phrase Queries

Phrase queries match sequences of tokens. They consist of one or more token expressions separated by whitespace. Note that the query has to match the tokenization conventions of the corpus at hand. For example, a tag question (", isn't it?") is typically split into five tokens and can be found with the query

    \, is n't it \?

A single + stands for an arbitrary token, a single * for an optional token. Multiple + and/or * can (and should) be bundled for a flexible number of tokens, e.g. ++*** for 2 to 5 arbitrary tokens.

Groups of tokens can be enclosed in round parentheses within a phrase query. Such groups may contain alternatives delimited by pipe symbols (vertical bar, |):

    it was ( ...A... | ...B... | ...C... )

will find "it was" followed by a token sequence that matches either the phrase query A, the phrase query B or the phrase query C. Empty alternatives are not allowed in this case. Whitespace can be omitted after the opening parenthesis, around the pipe symbols and before the closing parenthesis.

A quantifier can be appended to the closing parenthesis of a group, whether or not it includes alternatives. Note that there must not be any whitespace between the closing parenthesis and the quantifier (otherwise it would be interpreted as a separate token expression). Quantifiers specify repetition of the group:

    ( ... )?        0 or 1 (group is optional)
    ( ... )*        0 or more
    ( ... )+        1 or more
    ( ... ){N}      exactly N
    ( ... ){N,M}    between N and M
    ( ... ){N,}     at least N
    ( ... ){0,M}    at most M

Groups can contain further subgroups with alternatives and quantification. Note that group notation is needed to match an open-ended number of arbitrary tokens; it can also be more readable for finite ranges

    (+)?            same as: *
    (+)*            any number of arbitrary tokens
    (+)+            at least one arbitary token
    (+){2,5}        same as: ++***

You can think of the group (+) as a matchall symbol for an arbitrary token.

A token expression can be marked as an anchor point with an initial @ sign (for the "target" anchor). There must be no whitespace between the marker and the token expression. Numbered anchors are set with @0:, @1: through @9:. By default, @0: sets the "target" anchor and @1: sets the "keyword" anchor. Further numbered anchors need special support from the GUI software executing the CEQL queries.

Use XML tags to match the start and end of a s-attribute region, e.g. <s> for the start of a sentence and </s> for a sentence end. Since such tags denote token boundaries rather than full tokens, a tag by itself is not a valid query: always specify at least one token expression. A list of all <text> regions is obtained with

    <text> +

which matches the first token in each text. A pair of corresponding start and end tags matches a complete s-attribute region, e.g.

    <quote> (+)+ </quote>

a <quote> region containing an arbitary number of tokens (but keep in mind that CQP imposes limits on the number of tokens that can be matched, so very long quotations might not be found).

Attributes on XML start tags can be tested with the notation

    <tag_attribute=PATTERN>

where PATTERN is a wildcard pattern, possibly including negation and case/diacritic modifier flags. It is a quirk of the underlying CQP query language that every XML tag annotation is represented as a separate s-attribute following the indicated naming convention. Therefore, multiple start tags must be specified in order to test several annotations. Also keep in mind that an end tag with the same name is required for matching a full region. A named entity annotated in the input text as

    ... <ne type="ORG" status="fictional">Sirius Cybernetics Corp.</ne> ...

would be matched by the query

    <ne_type=org:c> <ne_status=fict*> (+)+ </ne_type>

Phrase queries can use different matching strategies, selected by a modifier at the start of the query. The default strategy (explicitly selected with (?standard)) includes optional elements at the start of the query, but uses non-greedy matching afterwards; in particular all optional elements at the end of the query are dropped. In some cases, the (?longest) strategy can be useful to include such optional elements and enable greedy matching of quantifiers. See the CQP Query Language Tutorial, Sec. 6.1 for details on matching strategies.

Proximity Queries

Proximity queries match co-occurrence patterns. They also build on token expressions, but do not allow any of the constructions of phrase queries. Instead, tokens are filtered based in their co-occurrence with other tokens. There are six basic forms of co-occurrence tests:

    A <<N>> B       B occurs within N tokens around A
    A <<N<< B       B occurs within N tokens to the left of A
    A >>N>> B       B occurs within N tokens to the right of A
    A <<REG>> B     A and B occur in the same region of s-attribute REG

    A <<K,N<< B     B occurs within N tokens to the left of A,
                    but at a distance of at least K tokens
    A >>K,N>> B     B occurs within N tokens to the right of A,
                    but at a distance of at least K tokens

In each case, those occurrences of token expression A are returned which satisfy the constraint. The corresponding positions of B cannot be accessed in the query result. As an example,

   {bucket} <<s>> {kick}_V*

would return all instances of the lemma "bucket" that occur in the same sentence as the verb "kick", but not the matching instances of "kick".

A and B can also be proximity queries themselves, using parentheses to determine the order of evaluation. As an example,

    (A <<3<< B) <<s>> (C <<2>> D)

finds all instances of A that are preceded by B (within 3 tokens to the left) and that also occur in the same sentence as a combination of C and D (within 2 tokens). Proximity queries can be nested to arbitrary depth.

There are two special cases for sequences without parentheses:

    A <<5>> B <<3<< C <<s>> D

applies multiple tests to the instance of A, i.e. it is implicitly parenthesised as

    ((A <<5>> B) <<3<< C) <<s>> D

A sequence of token expressions without any co-occurrence specifiers in between is interpreted as neighbouring tokens, i.e.

    out of {coin}

is rewritten to

    out >>1>> of >>2>> {coin}

and therefore returns only the positions of "out".

Neither XML tags nor anchor points are supported by proximity queries. Likewise, co-occurrence constraints cannot be negated, i.e. you cannot test for non-cooccurrence.

METHODS

The following API methods are inherited from CWB::CEQL::Parser. The explanations below focus on their application in a CEQL simple query frontend. The documentation of SetParam includes a complete listing of available configuration parameters as well as their usage and default values.

$CEQL = new CWB::CEQL;

Create parser object for CEQL queries. Use the Parse method of $CEQL to translate a CEQL query into CQP code.

$cqp_query = $CEQL->Parse($simple_query);

Parses simple query in CEQL syntax and returns equivalent CQP code. If there is a syntax error in $simple_query or parsing fails for some other reason, an undefined value is returned.

@text_lines = $CEQL->ErrorMessage;
$html_code = $CEQL->HtmlErrorMessage;

If the last CEQL query failed to parse, these methods return an error message either as a list of text lines (ErrorMessage) or as pre-formatted HTML code that can be used directly by a Web interface (HtmlErrorMessage). The error message includes a backtrace of the internal call stack in order to help users identify the precise location of the problem.

$CEQL->SetParam($name, $value);

Change parameters of the CEQL grammar. Currently, the following parameters are available:

pos_attribute

The p-attribute used to store part-of-speech tags in the CWB corpus (default: pos). CEQL queries should not be used for corpora without POS tagging, which we consider to be a minimal level of annotation.

lemma_attribute

The p-attribute used to store lemmata (base forms) in the CWB corpus (default: lemma). Set to undef if the corpus has not been lemmatised.

simple_pos

Lookup table for simple part-of-speech tags (in CEQL constructions like run_{N}). Must be a hashref with simple POS tags as keys and CQP regular expressions matching an appropriate set of standard POS tags as the corresponding values. The default value is undef, indicating that no simple POS tags have been defined. A very basic setup for the Penn Treebank tag set might look like this:

  $CEQL->SetParam("simple_pos", {
      "N" => "NN.*",   # common nouns
      "V" => "V.*",    # any verb forms
      "A" => "JJ.*",   # adjectives
    });
simple_pos_attribute

Simple POS tags may use a different p-attribute than standard POS tags, specified by the simple_pos_attribute parameter. If it is set to undef (default), the pos_attribute will be used for simplified POS tags as well.

s_attributes

Lookup table indicating which s-attributes in the CWB corpus may be accessed in CEQL queries (using the XML tag notation, e.g. <s> or </s>, or as a distance operator in proximity queries, e.g. <<s>>). The main purpose of this table is to keep the CEQL parser from passing through arbitrary tags to the CQP code, which might generate confusing error messages. Must be a hashref with the names of valid s-attributes as keys mapped to TRUE values. The default setting only allows sentences or s-unit, which should be annotated in every corpus:

  $CEQL->SetParam("s_attributes", { "s" => 1 });
default_ignore_case

Indicates whether CEQL queries should perform case-insensitive matching for word forms and lemmas (:c modifier), which can be overridden with an explicit :C modifier. By default, case-insensitive matching is activated, i.e. default_ignore_case is set to 1.

default_ignore_diac

Indicates whether CEQL queries should ignore accents (diacritics) for word forms and lemmas (:d modifier), which can be overridden with an explicit :D modifier. By default, matching does not ignore accents, i.e. default_ignore_diac is set to 0.

ignore_case

Individual case-insensitivity settings for different attributes. The parameter value is a hash with keys word_attribute, lemma_attribute, pos_attribute, simple_pos_attribute and s_attribute (for constraints on XML start tags), and values 0 or 1. If a key is not set in the hash, it defaults to default_ignore_case for word_attribute and lemma_attribute, and to 0 for all other attributes.

Extensions of the CEQL grammar can set and use further keys of their own choosing in the ignore_case and ignore_diac parameters.

ignore_diac

Individual diacritic-insensitivity settings for different attributes. The parameter value is a hash with keys word_attribute, lemma_attribute, pos_attribute, simple_pos_attribute and s_attribute, and values 0 or 1. If a key is not set in the hash, it defaults to default_ignore_diac for word_attribute and lemma_attribute, and to 0 for all other attributes.

tab_optimisation

Rewrite simple phrase searches (possibly with optional tokens, e.g. ++***) as TAB queries for much faster execution.

Note that the TAB rewrite may not be fully equivalent to the original phrase query in some corner cases. If there are optional gaps, it behaves similar to the standard matching strategy. Therefore, tab_optimisation should be disabled if a different matching strategy has been selected in CQP.

See the CWB::CEQL::Parser manpage for more detailed information and further methods.

EXTENDING CEQL

While the core CEQL syntax documented above already constitutes a fairly complex and powerful query language, CEQL is designed to be customized and extended. Such CEQL extensions are implemented by subclassing the standard CEQL grammar. They are typically provided as a separate Perl module file (.pm), but small ad-hoc extensions can also be included directly in a Perl script.

The basic template for a CEQL extension in a separate .pm file is as follows:

    package My::CEQL;
    use base 'CWB::CEQL';
   
    # override selected CEQL grammar rules here
   
    1;

You can then use My::CEQL; in your Perl scripts in the same way as CWB::CEQL.

Parameters

If you want to define new grammar parameters or change the default parameter settings, your grammar has to provide a constructor method that calls the constructor of the base grammar, e.g.

    sub new {
      my $class = shift;
      my $self = new CWB::CEQL;
   
      $self->NewParam("word_attribute", "word");
      $self->setParam("default_ignore_case", 0);
   
      return bless($self, $class);
    }

Overriding Grammar Rules

The standard CEQL grammar is split into many small rules. CEQL extensions are created by overriding individual rules completely. Start by copying the relevant rule from the CWB::CEQL source code into your .pm file, then modify it as required. See CWB::CEQL::Parser for details on how to write grammar rules. All rules of the standard CEQL grammar are listed in section "STANDARD CEQL RULES" below with short descriptions of their function and purpose.

For example, in order to make the word form attribute configurable (say, in a social medial corpus that has original and normalized spellings) with the word_attribute parameter introduced above, you would have to override the wordform_pattern rule. Copy the original rule into your grammar and modify it as follows:

    sub wordform_pattern {
      my ($self, $wf) = @_;
      my $test = $self->Call("negated_wildcard_pattern", $wf);
      my $word_att = $self->GetParam("word_attribute"); # <-- NEW
      return $word_att.$test;                           # <-- MODIFIED
    }

In some cases, it is easier to implement a wrapper than copy the full code of a complex grammar rule. This wrapper has to override the existing rule (otherwise all methods calling the rule would have to be changed), but will call into the base clase method. An example is the wrapper below, which extends the wildcard_pattern rule to allow full character-level regular expressions (delimited by /.../).

    sub wildcard_pattern {
      my ($self, $input) = @_;
      if ($input =~ m{^/(.+)/$}) {
        my $regexp = $1;
        $regexp =~ s/"/""/g; # escape double quotes
        return "\"$regexp\"";
      }
      else {
        return $self->SUPER::wildcard_pattern($input);
      }
     }

STANDARD CEQL RULES

ceql_query
default

The default rule of CWB::CEQL is ceql_query. After sanitising whitespace, it uses a heuristic to determine whether the input string is a phrase query or a proximity query and delegates parsing to the appropriate rule (phrase_query or proximity_query).

Phrase Query

phrase_query

A phrase query is the standard form of CEQL syntax. It matches a single token described by constraints on word form, lemma and/or part-of-speech tag, a sequence of such tokens, or a complex lexico-grammatical pattern. The phrase_query rule splits its input into whitespace-separated token expressions, XML tags and metacharacters such as (, ) and |. Then it applies the phrase_element rule to each item in turn, and concatenates the results into the complete CQP query. The phrase query may start with an embedded modifier such as (?longest) to change the matching strategy.

phrase_element

A phrase element is either a token expression (delegated to rule token_expression), a XML tag for matching structure boundaries (delegated to rule xml_tag), sequences of arbitrary (+) or skipped (*) tokens, or a phrase-level metacharacter (the latter two are handled by the phrase_element rule itself). Proper nesting of parenthesised groups is automatically ensured by the parser.

Token expressions can be preceded by @ to set a target marker, or @0: through @9: to set a numbered target marker.

xml_tag

A start or end tag matching the boundary of an s-attribute region. The xml_tag rule performs validation, in particularly ensuring that the region name is listed as an allowed s-attribute in the parameter s_attributes, then passes the tag through to the CQP query.

For a start tag, an optional wildcard pattern constraint may be specified in the form <tag=pattern>. The parser does not check whether the selected s-attribute in fact has annotations. If pattern starts with !, the constraint is negated; case/diacritic-sensitivity flags (:c etc.) can be appended to the pattern, before the closing >.

Proximity Query

proximity_query

A proximity query searches for combinations of words within a certain distance of each other, specified either as a number of tokens (numeric distance) or as co-occurrence within an s-attribute region (structural distance). The proximity_query rule splits its input into a sequence of token patterns, distance operators and parentheses used for grouping. Shorthand notation for word sequences is expanded (e.g. as long as into as >>1>> long >>2>> as), and then the proximity_expression rule is applied to each item in turn. A shift-reduce algorithm in proximity_expression reduces the resulting list into a single CQP query (using the "MU" notation).

proximity_expression

A proximity expression is either a token expression (delegated to token_expression), a distance operator (delegated to distance_operator) or a parenthesis for grouping subexpressions (handled directly). At each step, the current result list is examined to check whether the respective type of proximity expression is valid here. When 3 elements have been collected in the result list (term, operator, term), they are reduced to a single term. This ensures that the Apply method in proximity_query returns only a single string containing the (almost) complete CQP query.

distance_operator

A distance operator specifies the allowed distance between two tokens or subexpressions in a proximity query. Numeric distances are given as a number of tokens and can be two-sided (<<n>>) or one-sided (<<n<< to find the second term to the left of the first, or >>n>> to find it to the right). Structural distances are always two-sided and specifies an s-attribute region, in which both items must co-occur (e.g. <<s>>).

Token Expression

token_expression

Evaluate complete token expression with word form (or lemma) constraint and or part-of-speech (or simple POS) constraint. The two parts of the token expression are passed on to word_or_lemma_constraint and pos_constraint, respectively. This rule returns a CQP token expression enclosed in square brackets.

Word Form / Lemma

word_or_lemma_constraint

Evaluate complete word form (without curly braces) or lemma constraint (in curly braces, or with alternative % marker appended), including case/diacritic flags, and return a single CQP constraint with appropriate %c and %d flags.

wordform_pattern

Translate wildcard pattern for word form into CQP constraint (using the default word attribute).

lemma_pattern

Translate wildcard pattern for lemma into CQP constraint, using the appropriate p-attribute for base forms (given by the parameter lemma_attribute).

Parts of Speech

pos_constraint

Evaluate a part-of-speech constraint (either a pos_tag or simple_pos), returning suitable CQP code to be included in a token expression.

pos_tag

Translate wildcard pattern for part-of-speech tag into CQP constraint, using the appropriate p-attribute for POS tags (given by the parameter pos_attribute).

simple_pos

Translate simple part-of-speech tag into CQP constraint. The specified tag is looked up in the hash provided by the simple_pos parameter, and replaced by the regular expression listed there. If the tag cannot be found, or if no simple tags have been defined, a helpful error message is generated.

Wildcard Patterns

negated_wildcard_pattern

Wildcard pattern with optional negation (leading !). Returns quoted regular expression preceded by appropriate CQP comparison operator (= or !=).

For backward compatibility, the pattern ! is interpreted as a literal exclamation mark.

wildcard_pattern

Translate string containing wildcards into regular expression, which is enclosed in double quotes so it can directly be interpolated into a CQP query.

Internally, the input string is split into wildcards and literal substrings, which are then processed one item at a time with the wildcard_item rule.

wildcard_item

Process an item of a wildcard pattern, which is either some metacharacter (handled directly) or a literal substring (delegated to the literal_string rule). Proper nesting of alternatives is ensured using the shift-reduce parsing mechanism (with BeginGroup and EndGroup calls).

literal_string

Translate literal string into regular expression, escaping all metacharacters with backslashes (backslashes in the input string are removed first).

Internal Subroutines

($has_empty_alt, @tokens) = $self->_remove_empty_alternatives(@tokens);

This internal method identifies and removes empty alternatives from a tokenised group of alternatives (@tokens), with alternatives separated by | tokens. In particular, leading and trailing separator tokens are removed, and multiple consecutive separators are collapsed to a single |. The first return value ($has_empty_alt) indicates whether one or more empty alternatives were found; it is followed by the sanitised list of tokens.

($input, $flags) = $self->_parse_constraint_flags($input);
$cqp_flags = $self->_apply_constraint_flags($flags, $attribute);

Match flags :c, :C, :d and :D at end of subexpression, to turn case and/or diacritic insensitivity on or off (overriding the default settings for attribute type $attribute).

_parse_constraint_flags Returns $input with any flags removed and the CEQL flags as $flags. _apply_constraint_flags takes $flags and the attribute type $attribute, and returns corresponding CQP flags (%c, %d, %cd or an empty string) taking the appropriate defaults into account.

The second parameter is an attribute TYPE, i.e. one of "word_attribute", "lemma_attribute", "pos_attribute", "simple_pos_attribute", "s_attributes" - not the actual name of an attribute.

This operation hasn't been implemented as a grammar rule because it does not fit the paradigm of taking a single input string and returning a CQP translation of the input. It had to be split into two separate methods because in many cases, the attribute type can only be determined after further processing of $input.

COPYRIGHT

Copyright (C) 2005-2022 Stephanie Evert [https://purl.org/stephanie.evert]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.