CWB::CEQL - The Common Elementary Query Language for CQP front-ends
use CWB::CEQL; our $CEQL = new CWB::CEQL; $CEQL->SetParam("pos_attribute", "tags"); # **TODO: parameters** $cqp_query = $CEQL->Parse($ceql_query); if (not defined $cqp_query) { @error_msg = $CEQL->ErrorMessage; $html_msg = $CEQL->HtmlErrorMessage; } ## extend or modify standard CEQL grammar by subclassing package BNCWEB::CEQL; use base 'CWB::CEQL'; sub lemma { ## overwrite 'lemma' rule here (e.g. to allow for BNCweb's ``{bucket/N}'' notation) my $orig_result = $self->SUPER::lemma($string); # call original rule if needed } ## you can now use BNCWEB::CEQL in the same way as CWB::CEQL
** TODO **
Most important user-level methods inherited from CWB::CEQL::Parser.
Create parser object for CEQL queries. Use the Parse method of $CEQL to translate a CEQL query into CQP code.
Parses simple query in CEQL syntax and returns equivalent CQP code. If there is a syntax error in $simple_query or parsing fails for some other reason, an undefined value is returned.
If the last CEQL query failed to parse, these methods return an error message either as a list of text lines (ErrorMessage) or as pre-formatted HTML code that can be used directly by a Web interface (HtmlErrorMessage). The error message includes a backtrace of the internal call stack in order to help users identify the precise location of the problem.
Change parameters of the CEQL grammar. Currently, the following parameters are available:
pos_attribute
The p-attribute used to store part-of-speech tags in the CWB corpus (default: pos). CEQL queries should not be used for corpora without POS tagging, which we consider to be a minimal level of annotation.
pos
lemma_attribute
The p-attribute used to store lemmata (base forms) in the CWB corpus (default: lemma). Set to undef if the corpus has not been lemmatised.
lemma
simple_pos
Lookup table for simple part-of-speech tags (in CEQL constructions like run_{N}). Must be a hashref with simple POS tags as keys and CQP regular expressions matching an appropriate set of standard POS tags as the corresponding values. The default value is undef, indicating that no simple POS tags have been defined. A very basic setup for the Penn Treebank tag set might look like this:
run_{N}
$CEQL->SetParam("simple_pos", { "N" => "NN.*", # common nouns "V" => "V.*", # any verb forms "A" => "JJ.*", # adjectives });
simple_pos_attribute
Simple POS tags may use a different p-attribute than standard POS tags, specified by the simple_pos_attribute parameter. If it is set to undef (default), the pos_attribute will be used for simplified POS tags as well.
s_attributes
Lookup table indicating which s-attributes in the CWB corpus may be accessed in CEQL queries (using the XML tag notation, e.g. <s> or </s>, or as a distance operator in proximity queries, e.g. <<s>>). The main purpose of this table is to keep the CEQL parser from passing through arbitrary tags to the CQP code, which might generate confusing error messages. Must be a hashref with the names of valid s-attributes as keys mapped to TRUE values. The default setting only allows sentences or s-unit, which should be annotated in every corpus:
<s>
</s>
<<s>>
$CEQL->SetParam("s_attributes", { "s" => 1 });
default_ignore_case
Indicates whether CEQL queries should perform case-insensitive matching for word forms and lemmas (:c modifier), which can be overridden with an explicit :C modifier. By default, case-insensitive matching is activated, i.e. default_ignore_case is set to 1.
:c
:C
default_ignore_diac
Indicates whether CEQL queries should ignore accents (diacritics) for word forms and lemmas (:d modifier), which can be overridden with an explicit :D modifier. By default, matching does not ignore accents, i.e. default_ignore_diac is set to 0.
:d
:D
See the CWB::CEQL::Parser manpage for more detailed information and further methods.
** TODO **: How to extend the standard CEQL grammar by subclassing. Note that the grammar is split into many small rules, so it is easy to modify by overriding individual rules completely (without having to call the original rule in between or having to replicate complicated functionality).
See CWB::CEQL::Parser for details on how to write grammar rules. You should always have a copy of the CWB::CEQL source code file at hand when writing your extensions. All rules of the standard CEQL grammar are listed below with short descriptions of their function and purpose.
ceql_query
default
The default rule of CWB::CEQL is ceql_query. After sanitising whitespace, it uses a heuristic to determine whether the input string is a phrase query or a proximity query and delegates parsing to the appropriate rule (phrase_query or proximity_query).
phrase_query
proximity_query
A phrase query is the standard form of CEQL syntax. It matches a single token described by constraints on word form, lemma and/or part-of-speech tag, a sequence of such tokens, or a complex lexico-grammatical pattern. The phrase_query rule splits its input into whitespace-separated token expressions, XML tags and metacharacters such as (, ) and |. Then it applies the phrase_element rule to each item in turn, and concatenates the results into the complete CQP query.
(
)
|
phrase_element
A phrase element is either a token expression (delegated to rule token_expression), a XML tag for matching structure boundaries (delegated to rule xml_tag), sequences of arbitrary (+) or skipped (*) tokens, or a phrase-level metacharacter (the latter two are handled by the phrase_element rule itself). Proper nesting of parenthesised groups is automatically ensured by the parser.
token_expression
xml_tag
+
*
A start or end tag matching the boundary of an s-attribute region. The xml_tag rule only performs validation, in particularly ensuring that the region name is listed as an allowed s-attribute in the parameter s_attributes, then passes the tag through to the CQP query.
A proximity query searches for combinations of words within a certain distance of each other, specified either as a number of tokens (numeric distance) or as co-occurrence within an s-attribute region (structural distance). The proximity_query rule splits its input into a sequence of token patterns, distance operators and parentheses used for grouping. Shorthand notation for word sequences is expanded (e.g. as long as into as >>1>> long >>2>> as), and then the proximity_expression rule is applied to each item in turn. A shift-reduce algorithm in proximity_expression reduces the resulting list into a single CQP query (using the undocumented "MU" notation).
as long as
as >>1>> long >>2>> as
proximity_expression
A proximity expression is either a token expression (delegated to token_expression), a distance operator (delegated to distance_operator) or a parenthesis for grouping subexpressions (handled directly). At each step, the current result list is examined to check whether the respective type of proximity expression is valid here. When 3 elements have been collected in the result list (term, operator, term), they are reduced to a single term. This ensures that the Apply method in proximity_query returns only a single string containing the (almost) complete CQP query.
distance_operator
A distance operator specifies the allowed distance between two tokens or subexpressions in a proximity query. Numeric distances are given as a number of tokens and can be two-sided (<<n>>) or one-sided (<<n<< to find the second term to the left of the first, or >>n>> to find it to the right). Structural distances are always two-sided and specifies an s-attribute region, in which both items must co-occur (e.g. <<s>>).
<<n>>
<<n<<
>>n>>
Evaluate complete token expression with word form (or lemma) constraint and or part-of-speech (or simple POS) constraint. The two parts of the token expression are passed on to word_or_lemma_constraint and pos_constraint, respectively. This rule returns a CQP token expression enclosed in square brackets.
word_or_lemma_constraint
pos_constraint
Evaluate complete word form or lemma constraint, including case/diacritics flags, and return suitable CQP code to be included in a token expression
word_or_lemma
Evaluate word form (without curly braces) or lemma constraint (with curly braces) and return a single CQP constraint, to which %c and %d flags can then be added.
%c
%d
wordform_pattern
Translate wildcard pattern for word form into CQP constraint (using the default word attribute).
word
lemma_pattern
Translate wildcard pattern for lemma into CQP constraint, using the appropriate p-attribute for base forms (given by the parameter lemma_attribute).
Evaluate a part-of-speech constraint (either a pos_tag or simple_pos), returning suitable CQP code to be included in a token expression.
pos_tag
Translate wildcard pattern for part-of-speech tag into CQP constraint, using the appropriate p-attribute for POS tags (given by the parameter pos_attribute).
Translate simple part-of-speech tag into CQP constraint. The specified tag is looked up in the hash provided by the simple_pos parameter, and replaced by the regular expression listed there. If the tag cannot be found, or if no simple tags have been defined, a helpful error message is generated.
wildcard_pattern
Translate string containing wildcards into regular expression, which is enclosed in double quotes so it can directly be interpolated into a CQP query.
Internally, the input string is split into wildcards and literal substrings, which are then processed one item at a time with the wildcard_item rule.
wildcard_item
Process an item of a wildcard pattern, which is either some metacharacter (handled directly) or a literal substring (delegated to the literal_string rule). Proper nesting of alternatives is ensured using the shift-reduce parsing mechanism (with BeginGroup and EndGroup calls).
literal_string
Translate literal string into regular expression, escaping all metacharacters with backslashes (backslashes in the input string are removed first).
Note that escaping of ^ and " isn't fully reliable because CQP might interpret the resulting escape sequences as latex-style accents if they are followed by certain letters. Future versions of CQP should provide a safer escaping mechanism and/or allow interpretation of latex-style accents to be turned off.
^
"
This internal method identifies and removes empty alternatives from a tokenised group of alternatives (@tokens), with alternatives separated by | tokens. In particular, leading an trailing separator tokens are removed, and multiple consecutive separators are collapsed to a single |. The first return value ($has_empty_alt) indicates whether one or more empty alternatives were found; it is followed by the sanitised list of tokens.
Copyright (C) 1999-2010 Stefan Evert [http::/purl.org/stefan.evert]
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.
To install Alt::CWB::ambs, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Alt::CWB::ambs
CPAN shell
perl -MCPAN -e shell install Alt::CWB::ambs
For more information on module installation, please visit the detailed CPAN module installation guide.