The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CWB::CEQL - The Common Elementary Query Language for CQP front-ends

SYNOPSIS

  use CWB::CEQL;
  our $CEQL = new CWB::CEQL;

  $CEQL->SetParam("pos_attribute", "tags"); # **TODO: parameters**

  $cqp_query = $CEQL->Parse($ceql_query);
  if (not defined $cqp_query) {
    @error_msg = $CEQL->ErrorMessage;
    $html_msg = $CEQL->HtmlErrorMessage;
  }

  ## extend or modify standard CEQL grammar by subclassing
  package BNCWEB::CEQL;
  use base 'CWB::CEQL';

  sub lemma {
    ## overwrite 'lemma' rule here (e.g. to allow for BNCweb's ``{bucket/N}'' notation)
    my $orig_result = $self->SUPER::lemma($string); # call original rule if needed
  }

  ## you can now use BNCWEB::CEQL in the same way as CWB::CEQL

DESCRIPTION

** TODO **

METHODS

Most important user-level methods inherited from CWB::CEQL::Parser.

$CEQL = new CWB::CEQL;

Create parser object for CEQL queries. Use the Parse method of $CEQL to translate a CEQL query into CQP code.

$cqp_query = $CEQL->Parse($simple_query);

Parses simple query in CEQL syntax and returns equivalent CQP code. If there is a syntax error in $simple_query or parsing fails for some other reason, an undefined value is returned.

@text_lines = $CEQL->ErrorMessage;
$html_code = $CEQL->HtmlErrorMessage;

If the last CEQL query failed to parse, these methods return an error message either as a list of text lines (ErrorMessage) or as pre-formatted HTML code that can be used directly by a Web interface (HtmlErrorMessage). The error message includes a backtrace of the internal call stack in order to help users identify the precise location of the problem.

$CEQL->SetParam($name, $value);

Change parameters of the CEQL grammar. Currently, the following parameters are available:

pos_attribute

The p-attribute used to store part-of-speech tags in the CWB corpus (default: pos). CEQL queries should not be used for corpora without POS tagging, which we consider to be a minimal level of annotation.

lemma_attribute

The p-attribute used to store lemmata (base forms) in the CWB corpus (default: lemma). Set to undef if the corpus has not been lemmatised.

simple_pos

Lookup table for simple part-of-speech tags (in CEQL constructions like run_{N}). Must be a hashref with simple POS tags as keys and CQP regular expressions matching an appropriate set of standard POS tags as the corresponding values. The default value is undef, indicating that no simple POS tags have been defined. A very basic setup for the Penn Treebank tag set might look like this:

  $CEQL->SetParam("simple_pos", {
      "N" => "NN.*",   # common nouns
      "V" => "V.*",    # any verb forms
      "A" => "JJ.*",   # adjectives
    });
simple_pos_attribute

Simple POS tags may use a different p-attribute than standard POS tags, specified by the simple_pos_attribute parameter. If it is set to undef (default), the pos_attribute will be used for simplified POS tags as well.

s_attributes

Lookup table indicating which s-attributes in the CWB corpus may be accessed in CEQL queries (using the XML tag notation, e.g. <s> or </s>, or as a distance operator in proximity queries, e.g. <<s>>). The main purpose of this table is to keep the CEQL parser from passing through arbitrary tags to the CQP code, which might generate confusing error messages. Must be a hashref with the names of valid s-attributes as keys mapped to TRUE values. The default setting only allows sentences or s-unit, which should be annotated in every corpus:

  $CEQL->SetParam("s_attributes", { "s" => 1 });
default_ignore_case

Indicates whether CEQL queries should perform case-insensitive matching for word forms and lemmas (:c modifier), which can be overridden with an explicit :C modifier. By default, case-insensitive matching is activated, i.e. default_ignore_case is set to 1.

default_ignore_diac

Indicates whether CEQL queries should ignore accents (diacritics) for word forms and lemmas (:d modifier), which can be overridden with an explicit :D modifier. By default, matching does not ignore accents, i.e. default_ignore_diac is set to 0.

See the CWB::CEQL::Parser manpage for more detailed information and further methods.

CEQL SYNTAX

** TODO **

EXTENDING CEQL

** TODO **: How to extend the standard CEQL grammar by subclassing. Note that the grammar is split into many small rules, so it is easy to modify by overriding individual rules completely (without having to call the original rule in between or having to replicate complicated functionality).

See CWB::CEQL::Parser for details on how to write grammar rules. You should always have a copy of the CWB::CEQL source code file at hand when writing your extensions. All rules of the standard CEQL grammar are listed below with short descriptions of their function and purpose.

STANDARD CEQL RULES

ceql_query
default

The default rule of CWB::CEQL is ceql_query. After sanitising whitespace, it uses a heuristic to determine whether the input string is a phrase query or a proximity query and delegates parsing to the appropriate rule (phrase_query or proximity_query).

Phrase Query

phrase_query

A phrase query is the standard form of CEQL syntax. It matches a single token described by constraints on word form, lemma and/or part-of-speech tag, a sequence of such tokens, or a complex lexico-grammatical pattern. The phrase_query rule splits its input into whitespace-separated token expressions, XML tags and metacharacters such as (, ) and |. Then it applies the phrase_element rule to each item in turn, and concatenates the results into the complete CQP query.

phrase_element

A phrase element is either a token expression (delegated to rule token_expression), a XML tag for matching structure boundaries (delegated to rule xml_tag), sequences of arbitrary (+) or skipped (*) tokens, or a phrase-level metacharacter (the latter two are handled by the phrase_element rule itself). Proper nesting of parenthesised groups is automatically ensured by the parser.

xml_tag

A start or end tag matching the boundary of an s-attribute region. The xml_tag rule only performs validation, in particularly ensuring that the region name is listed as an allowed s-attribute in the parameter s_attributes, then passes the tag through to the CQP query.

Proximity Query

proximity_query

A proximity query searches for combinations of words within a certain distance of each other, specified either as a number of tokens (numeric distance) or as co-occurrence within an s-attribute region (structural distance). The proximity_query rule splits its input into a sequence of token patterns, distance operators and parentheses used for grouping. Shorthand notation for word sequences is expanded (e.g. as long as into as >>1>> long >>2>> as), and then the proximity_expression rule is applied to each item in turn. A shift-reduce algorithm in proximity_expression reduces the resulting list into a single CQP query (using the undocumented "MU" notation).

proximity_expression

A proximity expression is either a token expression (delegated to token_expression), a distance operator (delegated to distance_operator) or a parenthesis for grouping subexpressions (handled directly). At each step, the current result list is examined to check whether the respective type of proximity expression is valid here. When 3 elements have been collected in the result list (term, operator, term), they are reduced to a single term. This ensures that the Apply method in proximity_query returns only a single string containing the (almost) complete CQP query.

distance_operator

A distance operator specifies the allowed distance between two tokens or subexpressions in a proximity query. Numeric distances are given as a number of tokens and can be two-sided (<<n>>) or one-sided (<<n<< to find the second term to the left of the first, or >>n>> to find it to the right). Structural distances are always two-sided and specifies an s-attribute region, in which both items must co-occur (e.g. <<s>>).

Token Expression

token_expression

Evaluate complete token expression with word form (or lemma) constraint and or part-of-speech (or simple POS) constraint. The two parts of the token expression are passed on to word_or_lemma_constraint and pos_constraint, respectively. This rule returns a CQP token expression enclosed in square brackets.

Word Form / Lemma

word_or_lemma_constraint

Evaluate complete word form or lemma constraint, including case/diacritics flags, and return suitable CQP code to be included in a token expression

word_or_lemma

Evaluate word form (without curly braces) or lemma constraint (with curly braces) and return a single CQP constraint, to which %c and %d flags can then be added.

wordform_pattern

Translate wildcard pattern for word form into CQP constraint (using the default word attribute).

lemma_pattern

Translate wildcard pattern for lemma into CQP constraint, using the appropriate p-attribute for base forms (given by the parameter lemma_attribute).

Parts of Speech

pos_constraint

Evaluate a part-of-speech constraint (either a pos_tag or simple_pos), returning suitable CQP code to be included in a token expression.

pos_tag

Translate wildcard pattern for part-of-speech tag into CQP constraint, using the appropriate p-attribute for POS tags (given by the parameter pos_attribute).

simple_pos

Translate simple part-of-speech tag into CQP constraint. The specified tag is looked up in the hash provided by the simple_pos parameter, and replaced by the regular expression listed there. If the tag cannot be found, or if no simple tags have been defined, a helpful error message is generated.

Wildcard Patterns

wildcard_pattern

Translate string containing wildcards into regular expression, which is enclosed in double quotes so it can directly be interpolated into a CQP query.

Internally, the input string is split into wildcards and literal substrings, which are then processed one item at a time with the wildcard_item rule.

wildcard_item

Process an item of a wildcard pattern, which is either some metacharacter (handled directly) or a literal substring (delegated to the literal_string rule). Proper nesting of alternatives is ensured using the shift-reduce parsing mechanism (with BeginGroup and EndGroup calls).

literal_string

Translate literal string into regular expression, escaping all metacharacters with backslashes (backslashes in the input string are removed first).

Note that escaping of ^ and " isn't fully reliable because CQP might interpret the resulting escape sequences as latex-style accents if they are followed by certain letters. Future versions of CQP should provide a safer escaping mechanism and/or allow interpretation of latex-style accents to be turned off.

Internal Subroutines

($has_empty_alt, @tokens) = $self->_remove_empty_alternatives(@tokens);

This internal method identifies and removes empty alternatives from a tokenised group of alternatives (@tokens), with alternatives separated by | tokens. In particular, leading an trailing separator tokens are removed, and multiple consecutive separators are collapsed to a single |. The first return value ($has_empty_alt) indicates whether one or more empty alternatives were found; it is followed by the sanitised list of tokens.

COPYRIGHT

Copyright (C) 1999-2010 Stefan Evert [http::/purl.org/stefan.evert]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.